Skip to content

Docs update#72

Open
trbKnl wants to merge 3 commits into
developmentfrom
docs-update
Open

Docs update#72
trbKnl wants to merge 3 commits into
developmentfrom
docs-update

Conversation

@trbKnl

@trbKnl trbKnl commented Jun 11, 2026

Copy link
Copy Markdown
Collaborator

Summary by Human (Niek)

  • So docs are now reflecting the current status.
  • I cleaned up the language a made it as easy as possible.
  • Cleaned up the example.py and reflecting tutorial
  • I think the docs is a lot, @daniellemccool I verified the tutorials, whether they still make sense and they do. Can you verify (with AI) if the architecture article is to your liking. I verified as well, with AI multiple passes, and I did a manual diff inspection
  • when this is merged with master check whether the docs build correctly
  • @esvanhaeringen can you check whether the docs make sense to you? You can build the locally and inspect them, or read the flat files.

Summary

This PR lands several interconnected changes that have been developed across multiple branches, plus updated documentation to match.

Code changes

  • Tightened archive types (SeekableBinaryReader refactor): ZipArchiveReader, validate_zip, and all platform extractors now receive
    a SeekableBinaryReader object directly from the browser upload. Zip files are never materialised to a path or Pyodide's heap (AD0007).
    This affected every platform module and the upload pipeline.

  • Docstring-driven config generation: Each extractor function now carries a Table config:: JSON block in its docstring.
    scripts/generate_port_config.py parses these via AST (no Python import needed) and writes
    packages/python/port/configs/<platform>_config.json. The generator refuses to overwrite an existing config file (AD0012, AD0014).

  • Per-platform config files and VITE_PLATFORM required in dev: release.sh now discovers platforms by globbing
    configs/*_config.json — no hardcoded platform list. VITE_PLATFORM is required in dev mode; starting without it shows an error with a
    hint to run pnpm generate-config <platform> (AD0005 amendment, AD0013).

  • Standard platform module interface: Every platform module exposes EXTRACTOR_REGISTRY, extraction(), a FlowBuilder subclass, and
    process(). script.py is now fully platform-agnostic (AD0013).

  • example.py: Added a minimal but fully working example platform as the canonical starting template for new platforms.

  • Extractor integration test framework: ExtractorSpec dataclass + conftest.py for running extractors against real DDP zips stored
    locally. Tests skip cleanly when DDP data is absent from tests/ddp/ (AD0004).

  • Removed poetry.lock: The donation task runs inside Pyodide; package versions are dictated by Pyodide, not by the local lock file.

Documentation

  • Rewrote 04-flowbuilder.md and 05-extraction.md to reflect the current architecture (stream-based archive, config-driven extraction,
    single-platform builds).
  • Updated getting-started guide with a step-by-step platform creation workflow.
  • Updated deployment guide to document pnpm generate-config and the new multi-platform release flow.
  • Added ADRs: AD0012 (docstring config), AD0013 (platform interface), AD0014 (config overwrite policy), AD0004/testing (extractor
    integration tests).
  • Amended AD0005 to document the per-platform config discovery and VITE_PLATFORM requirement in dev.

Review Checklist

Architecture / correctness

  • SeekableBinaryReader is passed correctly all the way from py_worker.js → main.py → script.py → platform module → ZipArchiveReader /
    validate_zip — no path strings sneak through
  • script.py correctly handles the case where VITE_PLATFORM is unset or points to a platform without a config file (error visible in study
    UI)
  • run_extraction in table_extractor.py correctly filters empty tables before constructing ExtractionResult
  • The generator branch in FlowBuilder.start_flow() (isinstance(raw_result, Generator)) still works for platforms that yield intermediate
    commands during extraction (WhatsApp, Netflix)
  • Donation failure for a declined payload silently returns without showing the failure page — verify is_decline path in start_flow()

Config generation

  • pnpm generate-config example produces a valid example_config.json matching the committed one
  • Running pnpm generate-config example a second time (file already exists) exits with a non-zero code and an error message — does not
    overwrite
  • release.sh picks up all platforms from configs/ and builds one zip per platform
  • VITE_PLATFORM=example pnpm start starts without error; visiting localhost:3000 shows the example platform flow

Platform modules

  • All 10 platform modules (instagram, linkedin, facebook, tiktok, youtube, x, netflix, chrome, whatsapp, chatgpt) have EXTRACTOR_REGISTRY
    defined and all entries resolve to callable functions
  • Each platform's extraction() calls load_port_config(EXTRACTOR_REGISTRY, "") with the correct platform name string matching its
    config file name
  • WhatsApp and Netflix — the documented exceptions in AD0013 — still work end-to-end despite diverging signatures

Tests

  • pytest passes with no real DDP data present (integration tests skip, not fail)
  • Drop a real DDP zip into packages/python/tests/ddp/ and confirm the corresponding integration test runs and passes
  • test_zip_archive_reader.py and test_validate.py pass (these were significantly rewritten)

Documentation

  • doc/build/ should not be committed — it's a build artifact. Consider adding it to .gitignore or removing it from the PR
  • AD0012 index.yaml entry title says "Declarative TableConfig for platform extraction scripts" but the ADR file title says "Docstring-driven
    UI metadata for extractor functions" — these should match
  • AD0013 typo: "folename" → "filename" in the Consequences section
  • Mermaid diagrams in 04-flowbuilder.md render correctly in the built docs (requires sphinxcontrib.mermaid to be installed in the docs
    environment — check docs.yml workflow)
  • doc/source/architecture/index.rst deletion: confirm no other .rst file references it

CI / workflow

  • .github/workflows/docs.yml builds without error (the js mock import and sphinxcontrib.mermaid extension are new — confirm the docs
    dependencies are pinned somewhere)
  • check-deps.sh is called by pnpm start — verify it doesn't block CI

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant