Implement standard script declaration#69
Conversation
|
Todo:
|
daniellemccool
left a comment
There was a problem hiding this comment.
Okay, so I think there's at least one big thing that we have to deal with first before I can know how to comment better on any of the rest of it, because I can't know how it will look when it's done. Right now the most common use case we have is researchers who are doing multi-platform studies. The current release build system handles it in that way, looping over the included platforms with VITE_PLATFORM.
Right now if I wanted to switch to using port_config.json, I'd have to store them somewhere like pnpm generate-config instagram, then pu tit somewhere, then pnpm generate-config facebook, then put it somewhere, then somehow put it back into packages/python/port/port_config.json then VITE_PLATFORM=Instagram pnpm run build
(Relatedly, if VITE_PLATFORM is set, your validator never runs. From the script builder, the bundle has platform: void 0 so it works, but if VITE_PLATFORM is set, _read_port_config() never runs )
|
Allright @daniellemccool :) I replied to most of your comments, I hopefully addressed them all. I am still struggling a bit with the ADRs I amended AD005 probably not in the way we should but its going to revised anyway. I put AD0012 in the "ferry" format, and I added AD0013 to explain how to work with the configs |
7690d1c to
cb67cc9
Compare
daniellemccool
left a comment
There was a problem hiding this comment.
Flow
Help me make sure I understand the flow:
When we run the release, it sets VITE_PLATFORM to e.g. instagram, which becomes part of the bundle
worker_engine.ts gets the platform name
py_worker.js forwards to Python
main.start(sessionID, "instagram")
script.process(sessionId, "instagram") -> _check_platform_config("instagram") -> validate_or_raise("instagram")
this is where we read the config file from configs/instagram_config.json, do validation checks, and raise FileNotFoundError if there's no config file and validationError if something goes wrong
Then import_module("post.platforms.instagram") -> module.process(session_id) -> InstagramFlow.extraction() -> load_port_config(EXTRACTOR_REGISTRY, "instagram") -> reads configs/instagram_config.json again and builds the list of tables
Then run_extraction
Build time
At build time, there's dev with pnpm start which checks VITE_PLATFORM and that the config for the platform exists then runs the dev server
Then there's pnpm release for prod, and there f we have VITE_PLATFORM set, build just that one, otherwise glob configs/*_config.json and iterate over them, and then for each platform VITE_PLATFORM = ... pnpm run build -> zip
Yes?
ADRs
I reworked the ADRs a bit to try to get them more aligned with the why rather than the how. Next week, I'm going to redo all the ADRs to make them more simple and shorter etc., but I split out some things here to try to keep things more to a "single decision and consequence" style of thing.
General documentation, etc. before it gets to master
- We have to revisit the architecture documentation. I'm happy to do it because I think other things have changed as well
- It would be nice to have a migration script for old flow -> generate a new config file
- Changelog and version bump
|
Thanks a lot for the review!! Will make the changes and submit again! |
Table metadata (titles, descriptions, column headers, visualizations) previously
lived in a large DEFAULT_TABLE_CONFIG_JSON constant at the bottom of each platform
module. This was duplicated from extractor docstrings and optional, meaning the
system could silently run with stale or incorrect config.
This commit introduces a single-source-of-truth design (ADR-0012):
- Each extractor function carries its table configuration in a `Table config::`
JSON block embedded in its docstring.
- `scripts/generate_port_config.py` reads those blocks and assembles
`port_config.json`; a pnpm script `generate-config <platform>` wraps it.
- `port_config.json` is now a required build artifact. `script.py` raises at
startup if it is missing or invalid, replacing the old hardcoded PLATFORM_REGISTRY.
- New `TableConfig` dataclass and `run_extraction()` runner in `table_extractor.py`
eliminate the per-platform extraction boilerplate.
- `port_config_validator.py` cross-checks the generated file against the live
EXTRACTOR_REGISTRY and enforces schema correctness.
- All ten platform modules (instagram, facebook, linkedin, youtube, tiktok,
netflix, chatgpt, whatsapp, x, chrome) are ported to the new system.
The declarative config format is a prerequisite for external tooling: an external
application can generate or modify port_config.json to control which tables are
shown, how they are labelled, and what is extracted — without touching platform
source code. The same extractor code can then serve different study configurations
driven purely by the config file.
---
…rst example
Extractors are plain functions that return a DataFrame from a real DDP zip.
There was no systematic way to verify they still produce output when a platform's
export format changes.
This commit introduces a lightweight integration test harness (ADR-0004):
- The sole assertion is `not df.empty` — the only externally observable signal of
a broken extractor. Deeper diagnosis always requires manual inspection.
- `find_fixture(platform)` resolves a real DDP zip from `tests/ddp/<platform>_*.zip`.
The directory is git-ignored; real DDPs never enter version control (AD0001).
- Tests skip gracefully via `pytest.skip()` when no fixture is present, so CI
passes cleanly without real data.
- `test_extractor_integration_chatgpt.py` is the reference implementation,
covering `conversations_to_df` with a real ChatGPT DDP.
Adding coverage for a new platform requires one test file, one `ExtractorSpec`
instance, and a local DDP zip — no changes to production code.
`example.py` is a minimal, fully working platform that new contributors can copy as a starting point. It intentionally uses the simplest possible validator (accepts any zip) and a single extractor (`file_stats_to_df`) that returns a table of file-level statistics. Inline comments walk through every required piece: validator, extractor, `EXTRACTOR_REGISTRY`, and how to generate `port_config.json`. `port_config.json` is the generated config for the example platform and is checked in to the repository so the project works out of the box without requiring researchers to run the generator first. The `.gitignore` is updated accordingly to keep the file tracked.
Replace the single port_config.json with per-platform
configs/<platform>_config.json files. Key changes:
- generate_port_config.py writes to configs/<platform>_config.json;
generating one platform never touches another's file
- release.sh auto-discovers platforms by globbing configs/*.json,
removing the hardcoded platform list; VITE_PLATFORM=<p> pnpm release
builds only that one platform
- check-deps.sh enforces VITE_PLATFORM is set in dev mode and that
the matching config file exists
- script.py and main.py: platform is now always passed from VITE_PLATFORM
(no fallback to reading it from a config file)
- load_port_config() and validate() take an explicit platform argument
- port_config_validator separates FileNotFoundError from ValidationError
- pyproject.toml ships port/configs/ instead of port/port_config.json
- release.js removed; pnpm release now calls release.sh directly
- Add AD0013 documenting the standard platform module interface
- Update AD0012 and AD0005 to reflect the new file layout
only check example in
fix mistake in AD0013
9e7a492 to
452a732
Compare
Hi Danielle,
Please review this PR:
For the most notable changes see everything below. But in short:
Also notable:
Summary
This PR introduces a declarative, docstring-driven
configuration system. Table metadata (titles, descriptions, column headers, visualizations) now lives as a Table
config:: JSON block inside each extractor function's docstring. A code-generation script assembles those blocks into
port_config.json, which is a required build artifact consumed at runtime.
This declarative design is also a
prerequisite for external tooling: an external application can generate or modify port_config.json to control which
tables are shown and what is extracted — without touching platform source code.
Three commits:
scripts/generate_port_config.py, updated script.py, all 10 platform modules ported (ADR-0012).
of the box.
How it works
packages/python/port/port_config.json.
returns a list[TableConfig].
tables to the consent UI.
Integration tests
packages/python/tests/ without a DDP and confirm clean output.
used in production platform modules.
Documentation & ADRs
"Considered Options" and consequences still match).
(tests/ddp/_*.zip).
port_config.json.
Operational
platforms (instagram, facebook, etc.) are not accidentally committed.
requires.