Skip to content

Latest commit

 

History

History
77 lines (52 loc) · 6.25 KB

File metadata and controls

77 lines (52 loc) · 6.25 KB

Project Context Template

This is a worked example of the output the project interview agent produces. Use it as a reference for what a completed project_context.md looks like. Do not copy this verbatim into your project — run the interview agent against your actual codebase to produce one that reflects real intent and constraints.

The content below describes a fictional "Customer Retention Analysis" pipeline, included only to illustrate the shape and level of specificity expected in each section.


Project Context — Customer Retention Analysis

Date: 2026-04-12 Interviewed: Data scientist (solo contributor) Project stage: active development (pre-publication audit planned in ~6 weeks)

Purpose

This pipeline ingests monthly customer activity exports, joins them against a survey dataset, and produces retention estimates for an internal research report. It's not a production service — the end state is a reproducible analysis that a reviewer can re-run from raw data to final tables.

Data sources

  • activity/*.parquet — monthly activity exports from the product warehouse, one file per month, appended over time. Schema has been stable for 18 months but columns were renamed in the Jan 2025 export (documented in docs/schema_migration.md).
  • surveys/baseline.csv — one-time baseline survey collected at enrollment. ~12k respondents, ~8% missing on key covariates.
  • surveys/followup_*.csv — monthly followups. Response rates degrade over time (40% at month 1, ~18% by month 12).
  • dictionary.xlsx — maintained by the research coordinator. Authoritative source for variable definitions and allowable values.

Pipeline architecture

Numbered stages in pipelines/:

  1. 01_etl.py — loads raw files, normalizes column names per the dictionary, outputs data/interim/merged.parquet. Stable.
  2. 02_eda.py — profiling, missingness maps, distribution checks. Outputs plots to reports/eda/. Stable.
  3. 03_sample.py — applies inclusion criteria, produces the analytic sample. In progress — criteria still being finalized with the PI.
  4. 04_analysis.py — fits retention models, produces result tables and figures. Planned.

Utility modules in src/: dates.py, cleaning.py, scoring.py, io.py.

Key analytical decisions

  1. Inclusion window: participants with at least one activity record in the 90 days post-enrollment. Rationale: excludes accidental sign-ups. Encoded as config.activity_window_days.
  2. Retention definition: a participant is "retained" at month M if they have ≥1 activity event in the 14-day window centered on day 30M post-enrollment. The window (vs. a point estimate) accommodates activity burstiness.
  3. Survey linkage: inner join on participant_id. Participants without survey responses are dropped from retention models but kept in descriptive tables.
  4. Missing-data handling for survey covariates: complete-case for the primary model; multiple imputation as a sensitivity analysis. Not yet implemented — flagged in open questions.

Known issues and workarounds

  • 01b_etl_bypass_jan2025.py exists to handle the schema rename in the Jan 2025 activity export. Should be retired once the main ETL handles both schemas natively. No current owner for that refactor.
  • src/cleaning.py mixes genuinely shared helpers with one-off fixes for the Jan 2025 rename. The one-offs should move to the bypass script or get deleted when the bypass is retired.
  • Activity-file loader assumes monotonically increasing filenames. A file was re-exported out of order once; discovered by a row-count assertion but the assertion message wasn't specific enough to diagnose quickly.

Conventions

  • Naming: snake_case throughout. Derived variables suffix with _derived (e.g., retained_m3_derived). Source-column passthroughs keep their original names.
  • Config: config.yaml at repo root, loaded once per stage. No hardcoded thresholds in pipeline scripts — if you find one, it's a bug to file, not a convention to follow.
  • Logging: Python logging module at INFO level. Each stage logs input paths, row counts before/after each filter, and output paths. Logs are captured per-run to logs/YYYY-MM-DD_stage.log.
  • Testing: pytest for src/ utility functions. Pipeline stages are not unit-tested; they're verified by running end-to-end on a small frozen sample in tests/fixtures/.
  • Assertions: at every merge, check row counts against expected bounds and log the delta. At every filter, log N kept and N dropped.

Design tradeoffs

  • Config-driven over parameterized functions: analytical choices live in config.yaml, not function arguments, because the PI needs to audit every decision in one place. Tradeoff: less composable for reuse, but reuse isn't a goal.
  • Stage outputs to disk, not in-memory: each stage reads and writes parquet files. Slower than an in-memory pipeline but the researcher can inspect intermediates and re-run a single stage without re-running everything.
  • Complete-case primary analysis: simpler to explain in the report than MI as a headline result. MI is done as a sensitivity analysis instead.

Extension goals

  • Must support: swapping in a new followup survey wave without touching ETL.
  • Should support: adding a new outcome variable (e.g., engagement intensity) without restructuring the pipeline.
  • Out of scope: running this on a different study with different schemas. If that comes up, we'll fork.
  • Out of scope: packaging this for pip install. It's a pipeline, not a library.

Open questions

Things I'm uncertain about and want reviewer pressure on:

  1. Is the 14-day retention window the right choice, or should it be a single-day point estimate with a sensitivity check at ±7/±14? The current choice was pragmatic, not principled.
  2. The 03_sample.py inclusion criteria are still in flux. Reviewers: flag anywhere the inclusion logic is duplicated or implicit (e.g., if 04_analysis.py re-filters instead of trusting the analytic sample).
  3. Is src/cleaning.py salvageable or should it be split into src/schema_migration.py (Jan 2025 rename) + src/clean.py (genuine shared helpers)?
  4. I haven't pinned dependencies yet. Is requirements.txt with == pins sufficient, or should I use a lock file (pip-compile, poetry, uv)?