Adding roadmap by gwaybio · Pull Request #708 · cytomining/pycytominer

gwaybio · 2026-06-11T20:37:51Z

Explicitly signaling our pycytominer roadmap

Summary by CodeRabbit

Documentation
- Added a comprehensive project roadmap with timeline and milestones through v2.0, outlining planned API improvements, a fluent pipeline, performance/parallelization goals, and planned deprecations for a cleaner v2 API.
- Described longer-term "Beyond v2" goals (agentic/composable pipeline steps) and included current project state and contribution guidance.

coderabbitai · 2026-06-11T20:38:11Z

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 544078ef-5528-4dc1-939b-59beff5b8562

📥 Commits

Reviewing files that changed from the base of the PR and between 1a6c6a6 and 5e7fc4b.

📒 Files selected for processing (1)

ROADMAP.md

✅ Files skipped from review due to trivial changes (1)

ROADMAP.md

📝 Walkthrough

Walkthrough

Adds ROADMAP.md containing a project vision, a Mermaid timeline for v1.7→v2+, current v1.6 status, milestone sections for v1.7–v2.0 with checklists, a “Beyond v2” agentic infrastructure note, and a short contributing pointer to CONTRIBUTING.md.

Changes

Project Roadmap

Layer / File(s)	Summary
Vision and milestone timeline `ROADMAP.md`	Introduces the roadmap title, project vision, and a Mermaid timeline diagram mapping major milestone phases from v1.7 onwards.
Current state and milestone details `ROADMAP.md`	Documents v1.6 current state and detailed milestones: v1.7 API foundations (feature selection, API consistency, docs), v1.8 fluent `CytoDataFrame` pipeline and provenance, v1.9 parallelization and benchmarking, and v2.0 Polars migration plus API cleanup/deprecations.
Beyond v2 and contributing guidance `ROADMAP.md`	Describes post-v2 agentic infrastructure as agent-callable pipeline steps and provides a brief contributing section linking to `CONTRIBUTING.md`.

Estimated code review effort

🎯 1 (Trivial) | ⏱️ ~3 minutes

Poem

🐰 I hopped through lines to chart the way,
With timelines bright and checklists at play,
From v1.6 now toward skies anew,
I nibbled tasks and left a trail for you,
Come follow this map — the future's in view!

🚥 Pre-merge checks | ✅ 5

✅ Passed checks (5 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title 'Adding roadmap' directly and clearly describes the main change: adding a roadmap document to the repository.
Docstring Coverage	✅ Passed	No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

gwaybio · 2026-06-11T20:41:48Z

cc @axiomcura

coderabbitai

Actionable comments posted: 1

🧹 Nitpick comments (1)

ROADMAP.md (1)
90-90: ⚡ Quick win

Use hyphen in compound adjective: "High-Performance Backend".

When a compound adjective precedes a noun, it should be hyphenated for clarity. Correct "High Performance Backend" to "High-Performance Backend".
📝 Proposed fix
- ## Milestone 3 — Parallelization and High Performance Backend (v1.9)
+ ## Milestone 3 — Parallelization and High-Performance Backend (v1.9)
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@ROADMAP.md` at line 90, Update the heading "Milestone 3 — Parallelization and
High Performance Backend (v1.9)" to use a hyphenated compound adjective by
changing "High Performance Backend" to "High-Performance Backend" so the heading
reads "Milestone 3 — Parallelization and High-Performance Backend (v1.9)".

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@ROADMAP.md`:
- Around line 49-50: The check script currently extracts issue numbers with the
fragile pattern rg -o '#\d+' which wrongly matches hex colors in Mermaid/theme
blocks; update the extraction to only capture real GitHub issue links (e.g.,
match '/issues/\d+' or full markdown link patterns like
'\[.*?\]\(https?://github.com/[^/]+/[^/]+/issues/\d+\)') instead of '#\d+' so
only actual /issues/<number> links (e.g., from ROADMAP.md) are validated; modify
the script's occurrence of rg -o '#\d+' to a stricter regex (such as rg -oP
'/issues/\K\d+' or rg -oP '\(https?://github.com/[^/]+/[^/]+/issues/\K\d+') and
add a quick filter to skip code blocks/mermaid sections if present.

---

Nitpick comments:
In `@ROADMAP.md`:
- Line 90: Update the heading "Milestone 3 — Parallelization and High
Performance Backend (v1.9)" to use a hyphenated compound adjective by changing
"High Performance Backend" to "High-Performance Backend" so the heading reads
"Milestone 3 — Parallelization and High-Performance Backend (v1.9)".

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 33e5a79a-3de5-4f2d-9981-25b0e081dbbd

📥 Commits

Reviewing files that changed from the base of the PR and between edbe8d6 and c35df90.

📒 Files selected for processing (1)

ROADMAP.md

coderabbitai · 2026-06-11T20:43:39Z

+- [ ] Add `ProfileData` dataclass ([#327](https://github.com/cytomining/pycytominer/issues/327)) to bundle a DataFrame with its resolved feature and metadata column lists so they are computed once, not inferred repeatedly across function calls
+- [ ] Add `OutputConfig` dataclass to consolidate the four repeated output parameters (`output_file`, `output_type`, `compression_options`, `float_format`) shared across all core functions into a single per-call object


⚠️ Potential issue | 🟡 Minor

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash # Extract all issue numbers from ROADMAP.md and check their status # Get all issue numbers referenced in the file issues=$(rg -o '#\d+' ROADMAP.md | cut -d: -f2 | sort -u) echo "Checking GitHub issue status..." for issue in $issues; do issue_num=${issue#\#} echo "Issue $issue:" # Using GitHub CLI to fetch issue status gh api repos/cytomining/pycytominer/issues/$issue_num \ --jq '.state, .title' \ 2>/dev/null || echo " Not found or not accessible" done

Repository: cytomining/pycytominer

Length of output: 1728

🏁 Script executed:

#!/bin/bash set -euo pipefail echo "== Lines around 49-70 ==" nl -ba ROADMAP.md | sed -n '35,80p' echo echo "== All issue-like tokens (`#digits`) in ROADMAP.md ==" rg -o '#[0-9]+' ROADMAP.md | sort -u echo echo "== Context for unexpected issue-like tokens (`#0`, `#1`, `#111111`) ==" for t in '`#0`' '`#1`' '`#111111`' ; do num="${t#\#}" echo "--- $t ---" rg -n "\$t" ROADMAP.md || true done echo echo "== Context for originally cited issues ==" for n in 327 633 656 634 635 660 636 705 646 625 269 704 231 272; do echo "--- #$n ---" rg -n "#$n" ROADMAP.md || true done echo echo "== Markdown issue links to check (pattern: /issues/<num>) ==" rg -n "github\.com/.*/issues/[0-9]+|cytomining/pycytominer/issues/[0-9]+|/issues/[0-9]+" ROADMAP.md || true

Repository: cytomining/pycytominer

Length of output: 152

🏁 Script executed:

#!/bin/bash set -euo pipefail # Use the same extraction as the review script, but also print the exact matching lines echo "== Extracted issues with surrounding context ==" rg -n -o '#[0-9]+' ROADMAP.md | awk -F: '{print $1 ":" $2}' | sort -u > /tmp/issue_tokens.txt while IFS= read -r line; do file_and_lineno="${line%%:*}" token="${line##*:}" echo "--- $token (from $file_and_lineno) ---" # print the line containing the token (may include multiple tokens) sed -n "$(echo "$file_and_lineno" | sed 's/.*://')p" ROADMAP.md | cat -n | sed -n '1,1p' || true done < /tmp/issue_tokens.txt

Repository: cytomining/pycytominer

Length of output: 3126

FixROADMAP GitHub issue-link validation guidance

The referenced issue links in ROADMAP.md (#327, #633, #656, #634, #635, #660, #636, #705, #646, #625, #269, #704, #231, #272) resolve to issues and are currently open.

The proposed check script’s rg -o '#\d+' also matches hex colors in Mermaid theme variables (e.g., #111111), causing spurious lookups (e.g., #0, #1, #6, #111111); restrict extraction to actual /issues/<number> links.

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@ROADMAP.md` around lines 49 - 50, The check script currently extracts issue numbers with the fragile pattern rg -o '#\d+' which wrongly matches hex colors in Mermaid/theme blocks; update the extraction to only capture real GitHub issue links (e.g., match '/issues/\d+' or full markdown link patterns like '\[.*?\]$https?://github.com/[^/]+/[^/]+/issues/\d+$') instead of '#\d+' so only actual /issues/<number> links (e.g., from ROADMAP.md) are validated; modify the script's occurrence of rg -o '#\d+' to a stricter regex (such as rg -oP '/issues/\K\d+' or rg -oP '\(https?://github.com/[^/]+/[^/]+/issues/\K\d+') and add a quick filter to skip code blocks/mermaid sections if present.

codecov-commenter · 2026-06-11T20:53:10Z

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 94.04%. Comparing base (edbe8d6) to head (5e7fc4b).
⚠️ Report is 2 commits behind head on main.

Additional details and impacted files

@@           Coverage Diff           @@
##             main     #708   +/-   ##
=======================================
  Coverage   94.04%   94.04%           
=======================================
  Files          62       62           
  Lines        4504     4504           
=======================================
  Hits         4236     4236           
  Misses        268      268

Flag	Coverage Δ
unittests	`94.04% <ø> (ø)`

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Harness.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

d33bs

Exciting! Adding some initial thoughts.

d33bs · 2026-06-11T21:00:59Z

+             : Provenance and logging via chaining
+    section Performance
+        v1.9 : Parallel execution across plates and batches
+             : Pandas to Polars migration inside ProfileData


Polars will likely create possible breaking changes with internals. I suggest moving this to 2.0 .

maybe this is flawed thinking, but we would continue to support v1 API until dropping it in v2. In other words, the functions would continue working with polars or pandas, through ProfileData. It's possible i don't know enough about polars to be 100% sure this is possible.

d33bs · 2026-06-11T21:06:14Z

+    ProfileData.from_file("profiles.parquet")
+    .aggregate(strata=["Metadata_Well"])
+    .normalize(method="standardize", samples="Metadata_treatment == 'DMSO'")
+    .feature_select(operations=["variance_threshold", "correlation_threshold"])


Consider keeping the same existing api accepting similar input and sending to ProfileData on the backend so the workflow remains recognizable and requires no additional change. Here, there could be a comment made to imply that the data are wrapped / converted using the dataclass.

i don't think i understand this comment. What do you mean "same existing API"? and "sending to ProfileData on the backend"?

Is your suggestion to drop the example chain? I'll add something in the next commit, which may be a middle ground. LMK!

d33bs · 2026-06-11T21:32:03Z

+
+- [ ] Add pipeline methods to `ProfileData` (`.normalize()`, `.feature_select()`, `.aggregate()`, `.annotate()`, `.consensus()`) — each delegates to the existing standalone function and returns a new `ProfileData`
+- [ ] Core functions accept `ProfileData` as input in addition to `str` / `pd.DataFrame` — no breaking changes
+- [ ] Provenance and logging emerge naturally from chaining: for example, callers can compare `.features` before and after `feature_select` to see exactly which features were dropped by which operation


Consider better clarifying what is meant by "emerge naturally". I'd suggest keeping things relatively open when it comes to the implementation, which might have been the intent here but wasn't sure.

sounds good - yeah, this text is an odd mix of specificity and general thinking. Will revise

d33bs · 2026-06-11T21:32:57Z

+## Current State — v1.6
+
+Pycytominer provides a suite of standalone functions (`aggregate`, `normalize`, `annotate`, `feature_select`, `consensus`) that cover the full image-based profiling pipeline.
+The library supports multiple file formats (CSV, Parquet, AnnData, CytoTable Warehouse), runs on Linux, macOS, and Windows, and is tested across Python 3.10–3.14.


The versioning here for Python feels overspecific, consider dropping.

d33bs · 2026-06-11T21:43:07Z

+
+### ProfileData
+
+- [ ] Add `ProfileData` dataclass ([#327](https://github.com/cytomining/pycytominer/issues/327)) to bundle a DataFrame with its resolved feature and metadata column lists so they are computed once, not inferred repeatedly across function calls


How will ProfileData integrate with CytoDataFrame, if at all? If we plan to have both it probably means wrapper-inception (wrappers around wrappers). Imagine: we present Pycytominer.ProfileData and CytoDataFrame at the same time and the question comes up: "why have two data objects for similar goals"?

I vote we integrate them and make a push towards a common image-based profiling data type which can store the column information here and much more. While this might make the immediate path complex at first, it shows a more unified front for Cytomining with existing lessons learned from the project there.

sounds good! so you're intuition is to make all the ProfileData changes we discuss here inside CytoDataFrame? I agree it will be more complex, but probably more stable in the long run!

This begs a question then, why are we creating CytoDataFrame and not simply using AnnData? I think we need to have a compelling answer to this question.

As a person who has not used CytoDataFrame, one immediate concern with integrating ProfileData directly with CytoDataFrame is dependency weight and compatibility. If CytoDataFrame becomes a hard dependency of Pycytominer, it could narrow Pycytominer’s currently supported Python range and introduce a much larger dependency stack, including notebook, image, and visualization-related packages.

My questions is... do we want that ? 🤔

d33bs · 2026-06-11T21:48:45Z

+
+### ProfileData
+
+- [ ] Add `ProfileData` dataclass ([#327](https://github.com/cytomining/pycytominer/issues/327)) to bundle a DataFrame with its resolved feature and metadata column lists so they are computed once, not inferred repeatedly across function calls


Another idea: instead of an upfront cost to the user API and abstractions, could ProfileData live as a decorator pattern to core functions where needed (column filtering cost is likely not high and may change through each core function)? Or is there implied value to users by supplying the dataclass in return? If the latter, maybe we should spell this out briefly here.

my thought is to reduce redundant code, have a unified place for storing methods, enable functionality for all API functions consistently, and to enable chaining. I like the idea of CytoDataFrame handling this - i'll make a few edits in response

d33bs · 2026-06-11T21:51:43Z

+### ProfileData
+
+- [ ] Add `ProfileData` dataclass ([#327](https://github.com/cytomining/pycytominer/issues/327)) to bundle a DataFrame with its resolved feature and metadata column lists so they are computed once, not inferred repeatedly across function calls
+- [ ] Add `OutputConfig` dataclass to consolidate the four repeated output parameters (`output_file`, `output_type`, `compression_options`, `float_format`) shared across all core functions into a single per-call object


What if we expanded on the decorator pattern through @write_to_file_if_user_specifies_output_details to help with this instead of a public/user-facing dataclass?

how does the decision to use cytodataframe influence your thinking on this expansion?

d33bs · 2026-06-11T21:55:33Z

+
+### Polars Migration
+
+- [ ] Swap the DataFrame backend inside `ProfileData` from pandas to Polars — the change is contained within `ProfileData`, isolating all call sites from the migration


If we move from Pandas to Polars it's very likely not possible to isolate the changes to only the dataclass. For ex. any dataframe operations would need to undergo an overhaul (the API's and processing decisions differ quite a bit).

got it! related to a previous comment - i'll adjust

d33bs · 2026-06-11T21:56:03Z

+### Polars Migration
+
+- [ ] Swap the DataFrame backend inside `ProfileData` from pandas to Polars — the change is contained within `ProfileData`, isolating all call sites from the migration
+- [ ] Validate performance improvements across the full pipeline


Consider addressing "how" we validate, getting a little more specific; for ex. is this through continuous benchmarking?

i'm ok keeping this vague for now

d33bs · 2026-06-11T21:57:35Z

+
+- [ ] Swap the DataFrame backend inside `ProfileData` from pandas to Polars — the change is contained within `ProfileData`, isolating all call sites from the migration
+- [ ] Validate performance improvements across the full pipeline
+- [ ] Maintain backward compatibility where possible; document breaking changes clearly


Breaking changes imply we move this work to a major release, if following https://semver.org/ conventions.

yes, this is under v2

gwaybio · 2026-06-12T16:16:10Z

i responded to your comments @d33bs - thanks! I think this will help our thinking quite a bit for the months ahead. Can you take another look? There are some additional discussion items. please also make any changes directly

axiomcura

Looks good, left some comments !

axiomcura · 2026-06-12T16:11:19Z

+
+### ProfileData
+
+- [ ] Add `ProfileData` dataclass ([#327](https://github.com/cytomining/pycytominer/issues/327)) to bundle a DataFrame with its resolved feature and metadata column lists so they are computed once, not inferred repeatedly across function calls


As a person who has not used CytoDataFrame, one immediate concern with integrating ProfileData directly with CytoDataFrame is dependency weight and compatibility. If CytoDataFrame becomes a hard dependency of Pycytominer, it could narrow Pycytominer’s currently supported Python range and introduce a much larger dependency stack, including notebook, image, and visualization-related packages.

My questions is... do we want that ? 🤔

axiomcura · 2026-06-12T16:17:23Z

+## Current State — v1.6
+
+Pycytominer provides a suite of standalone functions (`aggregate`, `normalize`, `annotate`, `feature_select`, `consensus`) that cover the full image-based profiling pipeline.
+The library supports multiple file formats (CSV, Parquet, AnnData, CytoTable Warehouse), runs on Linux, macOS, and Windows.


This make it seem that all file formats work with all OSes, which I'd argue is not true:
https://github.com/gwaybio/pycytominer/blob/edbe8d6203732f13813c90abe086d5d74471b3d3/pycytominer/cyto_utils/load.py#L394

Given the csv.sniffer problem we have. #704

axiomcura · 2026-06-12T16:21:35Z

+### ProfileData
+
+- [ ] Add `ProfileData` dataclass ([#327](https://github.com/cytomining/pycytominer/issues/327)) to bundle a DataFrame with its resolved feature and metadata column lists so they are computed once, not inferred repeatedly across function calls
+- [ ] Add `OutputConfig` dataclass to consolidate the four repeated output parameters (`output_file`, `output_type`, `compression_options`, `float_format`) shared across all core functions into a single per-call object


I like this idea! but do you think that this should probably include or explicitly defer filename-based output type inference. #624

axiomcura · 2026-06-12T16:28:33Z

+
+**Goal:** Make pycytominer fast by enabling parallel execution across the pipeline.
+
+- [ ] Parallel execution of independent pipeline steps across plates, batches, or wells


Maybe this may not be a problem in the field, but I feel like we need some sort of scope rule, not just “plates, batches, or wells.”

One concern with the parallelization is that “across plates, batches, or wells” may be too broad without defining the valid scope for each operation. Some steps are safe to parallelize by plate or batch, but others depend on the full dataset or a specific reference population. For example, normalization, variance filtering, and correlation filtering can produce different results depending on whether they are computed globally or within each partition.

Maybe it is not worth mentioning here, but it could be an issue to keep in mind when developing parallelization support.

add roadmap

c35df90

gwaybio requested a review from d33bs as a code owner June 11, 2026 20:37

specify colors

e5adabc

coderabbitai Bot reviewed Jun 11, 2026

View reviewed changes

d33bs reviewed Jun 11, 2026

View reviewed changes

gwaybio added 2 commits June 12, 2026 10:02

pr review

1a6c6a6

more response

5e7fc4b

axiomcura reviewed Jun 12, 2026

View reviewed changes

		- [ ] Add `ProfileData` dataclass ([#327](https://github.com/cytomining/pycytominer/issues/327)) to bundle a DataFrame with its resolved feature and metadata column lists so they are computed once, not inferred repeatedly across function calls
		- [ ] Add `OutputConfig` dataclass to consolidate the four repeated output parameters (`output_file`, `output_type`, `compression_options`, `float_format`) shared across all core functions into a single per-call object


		### ProfileData

		- [ ] Add `ProfileData` dataclass ([#327](https://github.com/cytomining/pycytominer/issues/327)) to bundle a DataFrame with its resolved feature and metadata column lists so they are computed once, not inferred repeatedly across function calls


		### Polars Migration

		- [ ] Swap the DataFrame backend inside `ProfileData` from pandas to Polars — the change is contained within `ProfileData`, isolating all call sites from the migration


		Goal: Make pycytominer fast by enabling parallel execution across the pipeline.

		- [ ] Parallel execution of independent pipeline steps across plates, batches, or wells

Conversation

gwaybio commented Jun 11, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Uh oh!

coderabbitai Bot commented Jun 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

Poem

Uh oh!

gwaybio commented Jun 11, 2026

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Jun 11, 2026

Choose a reason for hiding this comment

Uh oh!

codecov-commenter commented Jun 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

d33bs left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

gwaybio commented Jun 12, 2026

Uh oh!

axiomcura left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

gwaybio commented Jun 11, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented Jun 11, 2026 •

edited

Loading

codecov-commenter commented Jun 11, 2026 •

edited

Loading