Skip to content

Implement standard script declaration#69

Merged
trbKnl merged 7 commits into
developmentfrom
implement-standard-script-declaration
May 26, 2026
Merged

Implement standard script declaration#69
trbKnl merged 7 commits into
developmentfrom
implement-standard-script-declaration

Conversation

@trbKnl

@trbKnl trbKnl commented May 11, 2026

Copy link
Copy Markdown
Collaborator

Hi Danielle,

Please review this PR:

For the most notable changes see everything below. But in short:

  • This PR has 3 commits, one introducing the declarative way of defining a flow. checkout the ADR for the details
  • I added the integration (not unit) tests, so checkout the ADR for the details.
  • I added the example platform

Also notable:

  • I removed the poetry lock file from git. It does not add anything, and it IMO should not be checked in:
  1. it leads to conflicts
  2. does not add anything; we do not need to use a very specific versions of pandas, or pytest when running in dev mode.

Summary

This PR introduces a declarative, docstring-driven
configuration system. Table metadata (titles, descriptions, column headers, visualizations) now lives as a Table
config:: JSON block inside each extractor function's docstring. A code-generation script assembles those blocks into
port_config.json, which is a required build artifact consumed at runtime.

This declarative design is also a
prerequisite for external tooling: an external application can generate or modify port_config.json to control which
tables are shown and what is extracted — without touching platform source code.

Three commits:

  1. feat — Core infrastructure: TableConfig dataclass, run_extraction() runner, port_config_validator.py,
    scripts/generate_port_config.py, updated script.py, all 10 platform modules ported (ADR-0012).
  2. test — ExtractorSpec integration test harness with ChatGPT as the reference implementation (ADR-0004).
  3. docs — example.py as a minimal onboarding reference platform; port_config.json checked in so the project works out
    of the box.

How it works

  1. Each extractor function in a platform module carries a Table config:: JSON block in its docstring.
  2. scripts/generate_port_config.py parses those blocks (via AST, no import needed) and writes
    packages/python/port/port_config.json.
  3. At startup script.py reads and validates the file, then dispatches to the correct platform module.
  4. load_port_config(EXTRACTOR_REGISTRY) in each platform resolves extractor name strings to live callables and
    returns a list[TableConfig].
  5. run_extraction(reader, errors, config) iterates the config, calls each extractor, and returns only non-empty
    tables to the consent UI.

Integration tests

  • test_extractor_integration_chatgpt.py skips (not fails) when no fixture zip is present — run pytest
    packages/python/tests/ without a DDP and confirm clean output.
  • extractor_integration_helpers.py imports validate.validate_zip — confirm this matches the actual function signature
    used in production platform modules.
  • conftest.py inserts tests/ onto sys.path. Verify this does not conflict with installed package imports.

Documentation & ADRs

  • ADR-0012 accurately reflects the final implementation (the ADR was rewritten once during development — confirm the
    "Considered Options" and consequences still match).
  • ADR-0004: confirm the fixture path documented in the ADR matches what find_fixture() actually looks for
    (tests/ddp/_*.zip).
  • example.py step-by-step instructions — follow them literally for a toy platform and confirm they produce a working
    port_config.json.

Operational

  • port_config.json is checked in for the example platform. Confirm .gitignore is set up so generated configs for real
    platforms (instagram, facebook, etc.) are not accidentally committed.
  • pyproject.toml / poetry.lock changes — verify no unexpected dependency additions beyond what table_extractor.py
    requires.

@trbKnl

trbKnl commented May 11, 2026

Copy link
Copy Markdown
Collaborator Author

Todo:

  • Update the documentation!

@daniellemccool daniellemccool left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, so I think there's at least one big thing that we have to deal with first before I can know how to comment better on any of the rest of it, because I can't know how it will look when it's done. Right now the most common use case we have is researchers who are doing multi-platform studies. The current release build system handles it in that way, looping over the included platforms with VITE_PLATFORM.

Right now if I wanted to switch to using port_config.json, I'd have to store them somewhere like pnpm generate-config instagram, then pu tit somewhere, then pnpm generate-config facebook, then put it somewhere, then somehow put it back into packages/python/port/port_config.json then VITE_PLATFORM=Instagram pnpm run build

(Relatedly, if VITE_PLATFORM is set, your validator never runs. From the script builder, the bundle has platform: void 0 so it works, but if VITE_PLATFORM is set, _read_port_config() never runs )

Comment thread scripts/gen_port_config.sh
Comment thread packages/python/port/helpers/port_config_validator.py
Comment thread packages/python/port/helpers/port_config_validator.py
Comment thread packages/python/port/script.py Outdated
Comment thread packages/python/port/helpers/port_config_validator.py Outdated
Comment thread packages/python/tests/test_extractor_integration_chatgpt.py Outdated
Comment thread packages/python/tests/test_extractor_integration_chatgpt.py Outdated
Comment thread packages/python/tests/conftest.py
@trbKnl

trbKnl commented May 20, 2026

Copy link
Copy Markdown
Collaborator Author

Allright @daniellemccool :)

I replied to most of your comments, I hopefully addressed them all.

I am still struggling a bit with the ADRs

I amended AD005 probably not in the way we should but its going to revised anyway.

I put AD0012 in the "ferry" format, and I added AD0013 to explain how to work with the configs

@trbKnl trbKnl force-pushed the implement-standard-script-declaration branch from 7690d1c to cb67cc9 Compare May 20, 2026 14:07

@daniellemccool daniellemccool left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

config-format.md

Flow

Help me make sure I understand the flow:

When we run the release, it sets VITE_PLATFORM to e.g. instagram, which becomes part of the bundle
worker_engine.ts gets the platform name
py_worker.js forwards to Python
main.start(sessionID, "instagram")
script.process(sessionId, "instagram") -> _check_platform_config("instagram") -> validate_or_raise("instagram")
this is where we read the config file from configs/instagram_config.json, do validation checks, and raise FileNotFoundError if there's no config file and validationError if something goes wrong

Then import_module("post.platforms.instagram") -> module.process(session_id) -> InstagramFlow.extraction() -> load_port_config(EXTRACTOR_REGISTRY, "instagram") -> reads configs/instagram_config.json again and builds the list of tables
Then run_extraction

Build time

At build time, there's dev with pnpm start which checks VITE_PLATFORM and that the config for the platform exists then runs the dev server
Then there's pnpm release for prod, and there f we have VITE_PLATFORM set, build just that one, otherwise glob configs/*_config.json and iterate over them, and then for each platform VITE_PLATFORM = ... pnpm run build -> zip

Yes?

ADRs

I reworked the ADRs a bit to try to get them more aligned with the why rather than the how. Next week, I'm going to redo all the ADRs to make them more simple and shorter etc., but I split out some things here to try to keep things more to a "single decision and consequence" style of thing.

General documentation, etc. before it gets to master

  1. We have to revisit the architecture documentation. I'm happy to do it because I think other things have changed as well
  2. It would be nice to have a migration script for old flow -> generate a new config file
  3. Changelog and version bump

Comment thread scripts/generate_port_config.py
@trbKnl

trbKnl commented May 22, 2026

Copy link
Copy Markdown
Collaborator Author

Thanks a lot for the review!! Will make the changes and submit again!

trbKnl added 7 commits May 26, 2026 15:59
  Table metadata (titles, descriptions, column headers, visualizations) previously
  lived in a large DEFAULT_TABLE_CONFIG_JSON constant at the bottom of each platform
  module. This was duplicated from extractor docstrings and optional, meaning the
  system could silently run with stale or incorrect config.

  This commit introduces a single-source-of-truth design (ADR-0012):

  - Each extractor function carries its table configuration in a `Table config::`
    JSON block embedded in its docstring.
  - `scripts/generate_port_config.py` reads those blocks and assembles
    `port_config.json`; a pnpm script `generate-config <platform>` wraps it.
  - `port_config.json` is now a required build artifact. `script.py` raises at
    startup if it is missing or invalid, replacing the old hardcoded PLATFORM_REGISTRY.
  - New `TableConfig` dataclass and `run_extraction()` runner in `table_extractor.py`
    eliminate the per-platform extraction boilerplate.
  - `port_config_validator.py` cross-checks the generated file against the live
    EXTRACTOR_REGISTRY and enforces schema correctness.
  - All ten platform modules (instagram, facebook, linkedin, youtube, tiktok,
    netflix, chatgpt, whatsapp, x, chrome) are ported to the new system.

  The declarative config format is a prerequisite for external tooling: an external
  application can generate or modify port_config.json to control which tables are
  shown, how they are labelled, and what is extracted — without touching platform
  source code. The same extractor code can then serve different study configurations
  driven purely by the config file.

  ---
…rst example

  Extractors are plain functions that return a DataFrame from a real DDP zip.
  There was no systematic way to verify they still produce output when a platform's
  export format changes.

  This commit introduces a lightweight integration test harness (ADR-0004):

  - The sole assertion is `not df.empty` — the only externally observable signal of
    a broken extractor. Deeper diagnosis always requires manual inspection.
  - `find_fixture(platform)` resolves a real DDP zip from `tests/ddp/<platform>_*.zip`.
    The directory is git-ignored; real DDPs never enter version control (AD0001).
  - Tests skip gracefully via `pytest.skip()` when no fixture is present, so CI
    passes cleanly without real data.
  - `test_extractor_integration_chatgpt.py` is the reference implementation,
    covering `conversations_to_df` with a real ChatGPT DDP.

  Adding coverage for a new platform requires one test file, one `ExtractorSpec`
  instance, and a local DDP zip — no changes to production code.
  `example.py` is a minimal, fully working platform that new contributors can
  copy as a starting point. It intentionally uses the simplest possible validator
  (accepts any zip) and a single extractor (`file_stats_to_df`) that returns a
  table of file-level statistics. Inline comments walk through every required
  piece: validator, extractor, `EXTRACTOR_REGISTRY`, and how to generate
  `port_config.json`.

  `port_config.json` is the generated config for the example platform and is
  checked in to the repository so the project works out of the box without
  requiring researchers to run the generator first. The `.gitignore` is updated
  accordingly to keep the file tracked.
  Replace the single port_config.json with per-platform
  configs/<platform>_config.json files. Key changes:

  - generate_port_config.py writes to configs/<platform>_config.json;
    generating one platform never touches another's file
  - release.sh auto-discovers platforms by globbing configs/*.json,
    removing the hardcoded platform list; VITE_PLATFORM=<p> pnpm release
    builds only that one platform
  - check-deps.sh enforces VITE_PLATFORM is set in dev mode and that
    the matching config file exists
  - script.py and main.py: platform is now always passed from VITE_PLATFORM
    (no fallback to reading it from a config file)
  - load_port_config() and validate() take an explicit platform argument
  - port_config_validator separates FileNotFoundError from ValidationError
  - pyproject.toml ships port/configs/ instead of port/port_config.json
  - release.js removed; pnpm release now calls release.sh directly
  - Add AD0013 documenting the standard platform module interface
  - Update AD0012 and AD0005 to reflect the new file layout

only check example in

fix mistake in AD0013
@trbKnl trbKnl force-pushed the implement-standard-script-declaration branch from 9e7a492 to 452a732 Compare May 26, 2026 14:06
@trbKnl trbKnl merged commit 5758072 into development May 26, 2026
1 check passed
@trbKnl trbKnl deleted the implement-standard-script-declaration branch May 26, 2026 14:07
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants