This is a worked example of the output the project interview agent produces. Use it as a reference for what a completed project_context.md looks like. Do not copy this verbatim into your project — run the interview agent against your actual codebase to produce one that reflects real intent and constraints.
The content below describes a fictional "Customer Retention Analysis" pipeline, included only to illustrate the shape and level of specificity expected in each section.
Date: 2026-04-12 Interviewed: Data scientist (solo contributor) Project stage: active development (pre-publication audit planned in ~6 weeks)
This pipeline ingests monthly customer activity exports, joins them against a survey dataset, and produces retention estimates for an internal research report. It's not a production service — the end state is a reproducible analysis that a reviewer can re-run from raw data to final tables.
activity/*.parquet— monthly activity exports from the product warehouse, one file per month, appended over time. Schema has been stable for 18 months but columns were renamed in the Jan 2025 export (documented indocs/schema_migration.md).surveys/baseline.csv— one-time baseline survey collected at enrollment. ~12k respondents, ~8% missing on key covariates.surveys/followup_*.csv— monthly followups. Response rates degrade over time (40% at month 1, ~18% by month 12).dictionary.xlsx— maintained by the research coordinator. Authoritative source for variable definitions and allowable values.
Numbered stages in pipelines/:
01_etl.py— loads raw files, normalizes column names per the dictionary, outputsdata/interim/merged.parquet. Stable.02_eda.py— profiling, missingness maps, distribution checks. Outputs plots toreports/eda/. Stable.03_sample.py— applies inclusion criteria, produces the analytic sample. In progress — criteria still being finalized with the PI.04_analysis.py— fits retention models, produces result tables and figures. Planned.
Utility modules in src/: dates.py, cleaning.py, scoring.py, io.py.
- Inclusion window: participants with at least one activity record in the 90 days post-enrollment. Rationale: excludes accidental sign-ups. Encoded as
config.activity_window_days. - Retention definition: a participant is "retained" at month M if they have ≥1 activity event in the 14-day window centered on day 30M post-enrollment. The window (vs. a point estimate) accommodates activity burstiness.
- Survey linkage: inner join on
participant_id. Participants without survey responses are dropped from retention models but kept in descriptive tables. - Missing-data handling for survey covariates: complete-case for the primary model; multiple imputation as a sensitivity analysis. Not yet implemented — flagged in open questions.
01b_etl_bypass_jan2025.pyexists to handle the schema rename in the Jan 2025 activity export. Should be retired once the main ETL handles both schemas natively. No current owner for that refactor.src/cleaning.pymixes genuinely shared helpers with one-off fixes for the Jan 2025 rename. The one-offs should move to the bypass script or get deleted when the bypass is retired.- Activity-file loader assumes monotonically increasing filenames. A file was re-exported out of order once; discovered by a row-count assertion but the assertion message wasn't specific enough to diagnose quickly.
- Naming: snake_case throughout. Derived variables suffix with
_derived(e.g.,retained_m3_derived). Source-column passthroughs keep their original names. - Config:
config.yamlat repo root, loaded once per stage. No hardcoded thresholds in pipeline scripts — if you find one, it's a bug to file, not a convention to follow. - Logging: Python
loggingmodule at INFO level. Each stage logs input paths, row counts before/after each filter, and output paths. Logs are captured per-run tologs/YYYY-MM-DD_stage.log. - Testing:
pytestforsrc/utility functions. Pipeline stages are not unit-tested; they're verified by running end-to-end on a small frozen sample intests/fixtures/. - Assertions: at every merge, check row counts against expected bounds and log the delta. At every filter, log N kept and N dropped.
- Config-driven over parameterized functions: analytical choices live in
config.yaml, not function arguments, because the PI needs to audit every decision in one place. Tradeoff: less composable for reuse, but reuse isn't a goal. - Stage outputs to disk, not in-memory: each stage reads and writes parquet files. Slower than an in-memory pipeline but the researcher can inspect intermediates and re-run a single stage without re-running everything.
- Complete-case primary analysis: simpler to explain in the report than MI as a headline result. MI is done as a sensitivity analysis instead.
- Must support: swapping in a new followup survey wave without touching ETL.
- Should support: adding a new outcome variable (e.g., engagement intensity) without restructuring the pipeline.
- Out of scope: running this on a different study with different schemas. If that comes up, we'll fork.
- Out of scope: packaging this for pip install. It's a pipeline, not a library.
Things I'm uncertain about and want reviewer pressure on:
- Is the 14-day retention window the right choice, or should it be a single-day point estimate with a sensitivity check at ±7/±14? The current choice was pragmatic, not principled.
- The
03_sample.pyinclusion criteria are still in flux. Reviewers: flag anywhere the inclusion logic is duplicated or implicit (e.g., if04_analysis.pyre-filters instead of trusting the analytic sample). - Is
src/cleaning.pysalvageable or should it be split intosrc/schema_migration.py(Jan 2025 rename) +src/clean.py(genuine shared helpers)? - I haven't pinned dependencies yet. Is
requirements.txtwith==pins sufficient, or should I use a lock file (pip-compile,poetry,uv)?