Implement standard script declaration by trbKnl · Pull Request #69 · d3i-infra/data-donation-task

trbKnl · 2026-05-11T11:52:58Z

Hi Danielle,

Please review this PR:

For the most notable changes see everything below. But in short:

This PR has 3 commits, one introducing the declarative way of defining a flow. checkout the ADR for the details
I added the integration (not unit) tests, so checkout the ADR for the details.
I added the example platform

Also notable:

I removed the poetry lock file from git. It does not add anything, and it IMO should not be checked in:

it leads to conflicts
does not add anything; we do not need to use a very specific versions of pandas, or pytest when running in dev mode.

Summary

This PR introduces a declarative, docstring-driven
configuration system. Table metadata (titles, descriptions, column headers, visualizations) now lives as a Table
config:: JSON block inside each extractor function's docstring. A code-generation script assembles those blocks into
port_config.json, which is a required build artifact consumed at runtime.

This declarative design is also a
prerequisite for external tooling: an external application can generate or modify port_config.json to control which
tables are shown and what is extracted — without touching platform source code.

Three commits:

feat — Core infrastructure: TableConfig dataclass, run_extraction() runner, port_config_validator.py,
scripts/generate_port_config.py, updated script.py, all 10 platform modules ported (ADR-0012).
test — ExtractorSpec integration test harness with ChatGPT as the reference implementation (ADR-0004).
docs — example.py as a minimal onboarding reference platform; port_config.json checked in so the project works out
of the box.

How it works

Each extractor function in a platform module carries a Table config:: JSON block in its docstring.
scripts/generate_port_config.py parses those blocks (via AST, no import needed) and writes
packages/python/port/port_config.json.
At startup script.py reads and validates the file, then dispatches to the correct platform module.
load_port_config(EXTRACTOR_REGISTRY) in each platform resolves extractor name strings to live callables and
returns a list[TableConfig].
run_extraction(reader, errors, config) iterates the config, calls each extractor, and returns only non-empty
tables to the consent UI.

Integration tests

test_extractor_integration_chatgpt.py skips (not fails) when no fixture zip is present — run pytest
packages/python/tests/ without a DDP and confirm clean output.
extractor_integration_helpers.py imports validate.validate_zip — confirm this matches the actual function signature
used in production platform modules.
conftest.py inserts tests/ onto sys.path. Verify this does not conflict with installed package imports.

Documentation & ADRs

ADR-0012 accurately reflects the final implementation (the ADR was rewritten once during development — confirm the
"Considered Options" and consequences still match).
ADR-0004: confirm the fixture path documented in the ADR matches what find_fixture() actually looks for
(tests/ddp/_*.zip).
example.py step-by-step instructions — follow them literally for a toy platform and confirm they produce a working
port_config.json.

Operational

port_config.json is checked in for the example platform. Confirm .gitignore is set up so generated configs for real
platforms (instagram, facebook, etc.) are not accidentally committed.
pyproject.toml / poetry.lock changes — verify no unexpected dependency additions beyond what table_extractor.py
requires.

trbKnl · 2026-05-11T11:58:26Z

Todo:

Update the documentation!

daniellemccool

Okay, so I think there's at least one big thing that we have to deal with first before I can know how to comment better on any of the rest of it, because I can't know how it will look when it's done. Right now the most common use case we have is researchers who are doing multi-platform studies. The current release build system handles it in that way, looping over the included platforms with VITE_PLATFORM.

Right now if I wanted to switch to using port_config.json, I'd have to store them somewhere like pnpm generate-config instagram, then pu tit somewhere, then pnpm generate-config facebook, then put it somewhere, then somehow put it back into packages/python/port/port_config.json then VITE_PLATFORM=Instagram pnpm run build

(Relatedly, if VITE_PLATFORM is set, your validator never runs. From the script builder, the bundle has platform: void 0 so it works, but if VITE_PLATFORM is set, _read_port_config() never runs )

trbKnl · 2026-05-20T13:59:09Z

Allright @daniellemccool :)

I replied to most of your comments, I hopefully addressed them all.

I am still struggling a bit with the ADRs

I amended AD005 probably not in the way we should but its going to revised anyway.

I put AD0012 in the "ferry" format, and I added AD0013 to explain how to work with the configs

daniellemccool

config-format.md

Flow

Help me make sure I understand the flow:

When we run the release, it sets VITE_PLATFORM to e.g. instagram, which becomes part of the bundle
worker_engine.ts gets the platform name
py_worker.js forwards to Python
main.start(sessionID, "instagram")
script.process(sessionId, "instagram") -> _check_platform_config("instagram") -> validate_or_raise("instagram")
this is where we read the config file from configs/instagram_config.json, do validation checks, and raise FileNotFoundError if there's no config file and validationError if something goes wrong

Then import_module("post.platforms.instagram") -> module.process(session_id) -> InstagramFlow.extraction() -> load_port_config(EXTRACTOR_REGISTRY, "instagram") -> reads configs/instagram_config.json again and builds the list of tables
Then run_extraction

Build time

At build time, there's dev with pnpm start which checks VITE_PLATFORM and that the config for the platform exists then runs the dev server
Then there's pnpm release for prod, and there f we have VITE_PLATFORM set, build just that one, otherwise glob configs/*_config.json and iterate over them, and then for each platform VITE_PLATFORM = ... pnpm run build -> zip

Yes?

ADRs

I reworked the ADRs a bit to try to get them more aligned with the why rather than the how. Next week, I'm going to redo all the ADRs to make them more simple and shorter etc., but I split out some things here to try to keep things more to a "single decision and consequence" style of thing.

General documentation, etc. before it gets to master

We have to revisit the architecture documentation. I'm happy to do it because I think other things have changed as well
It would be nice to have a migration script for old flow -> generate a new config file
Changelog and version bump

trbKnl · 2026-05-22T18:35:27Z

Thanks a lot for the review!! Will make the changes and submit again!

Table metadata (titles, descriptions, column headers, visualizations) previously lived in a large DEFAULT_TABLE_CONFIG_JSON constant at the bottom of each platform module. This was duplicated from extractor docstrings and optional, meaning the system could silently run with stale or incorrect config. This commit introduces a single-source-of-truth design (ADR-0012): - Each extractor function carries its table configuration in a `Table config::` JSON block embedded in its docstring. - `scripts/generate_port_config.py` reads those blocks and assembles `port_config.json`; a pnpm script `generate-config <platform>` wraps it. - `port_config.json` is now a required build artifact. `script.py` raises at startup if it is missing or invalid, replacing the old hardcoded PLATFORM_REGISTRY. - New `TableConfig` dataclass and `run_extraction()` runner in `table_extractor.py` eliminate the per-platform extraction boilerplate. - `port_config_validator.py` cross-checks the generated file against the live EXTRACTOR_REGISTRY and enforces schema correctness. - All ten platform modules (instagram, facebook, linkedin, youtube, tiktok, netflix, chatgpt, whatsapp, x, chrome) are ported to the new system. The declarative config format is a prerequisite for external tooling: an external application can generate or modify port_config.json to control which tables are shown, how they are labelled, and what is extracted — without touching platform source code. The same extractor code can then serve different study configurations driven purely by the config file. ---

…rst example Extractors are plain functions that return a DataFrame from a real DDP zip. There was no systematic way to verify they still produce output when a platform's export format changes. This commit introduces a lightweight integration test harness (ADR-0004): - The sole assertion is `not df.empty` — the only externally observable signal of a broken extractor. Deeper diagnosis always requires manual inspection. - `find_fixture(platform)` resolves a real DDP zip from `tests/ddp/<platform>_*.zip`. The directory is git-ignored; real DDPs never enter version control (AD0001). - Tests skip gracefully via `pytest.skip()` when no fixture is present, so CI passes cleanly without real data. - `test_extractor_integration_chatgpt.py` is the reference implementation, covering `conversations_to_df` with a real ChatGPT DDP. Adding coverage for a new platform requires one test file, one `ExtractorSpec` instance, and a local DDP zip — no changes to production code.

`example.py` is a minimal, fully working platform that new contributors can copy as a starting point. It intentionally uses the simplest possible validator (accepts any zip) and a single extractor (`file_stats_to_df`) that returns a table of file-level statistics. Inline comments walk through every required piece: validator, extractor, `EXTRACTOR_REGISTRY`, and how to generate `port_config.json`. `port_config.json` is the generated config for the example platform and is checked in to the repository so the project works out of the box without requiring researchers to run the generator first. The `.gitignore` is updated accordingly to keep the file tracked.

Replace the single port_config.json with per-platform configs/<platform>_config.json files. Key changes: - generate_port_config.py writes to configs/<platform>_config.json; generating one platform never touches another's file - release.sh auto-discovers platforms by globbing configs/*.json, removing the hardcoded platform list; VITE_PLATFORM=<p> pnpm release builds only that one platform - check-deps.sh enforces VITE_PLATFORM is set in dev mode and that the matching config file exists - script.py and main.py: platform is now always passed from VITE_PLATFORM (no fallback to reading it from a config file) - load_port_config() and validate() take an explicit platform argument - port_config_validator separates FileNotFoundError from ValidationError - pyproject.toml ships port/configs/ instead of port/port_config.json - release.js removed; pnpm release now calls release.sh directly - Add AD0013 documenting the standard platform module interface - Update AD0012 and AD0005 to reflect the new file layout only check example in fix mistake in AD0013

…lder

trbKnl requested a review from daniellemccool May 11, 2026 11:52

trbKnl assigned daniellemccool and trbKnl May 11, 2026

daniellemccool reviewed May 19, 2026

View reviewed changes

trbKnl force-pushed the implement-standard-script-declaration branch from 7690d1c to cb67cc9 Compare May 20, 2026 14:07

daniellemccool reviewed May 22, 2026

View reviewed changes

trbKnl added 7 commits May 26, 2026 15:59

implemented conftest.py

67bfde9

removed timestamp folder creation, release can be dumped in a flat fo…

ba35ff6

…lder

reworked ADRs based on Danielle's suggestions

452a732

trbKnl force-pushed the implement-standard-script-declaration branch from 9e7a492 to 452a732 Compare May 26, 2026 14:06

trbKnl merged commit 5758072 into development May 26, 2026
1 check passed

trbKnl deleted the implement-standard-script-declaration branch May 26, 2026 14:07

trbKnl mentioned this pull request Jun 12, 2026

Building multiple platforms using config file #59

Closed

Conversation

trbKnl commented May 11, 2026

Uh oh!

trbKnl commented May 11, 2026

Uh oh!

daniellemccool left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

trbKnl commented May 20, 2026

Uh oh!

daniellemccool left a comment

Choose a reason for hiding this comment

Flow

Build time

ADRs

General documentation, etc. before it gets to master

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

trbKnl commented May 22, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants