Adding roadmap#708
Conversation
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: defaults Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (1)
✅ Files skipped from review due to trivial changes (1)
📝 WalkthroughWalkthroughAdds ROADMAP.md containing a project vision, a Mermaid timeline for v1.7→v2+, current v1.6 status, milestone sections for v1.7–v2.0 with checklists, a “Beyond v2” agentic infrastructure note, and a short contributing pointer to CONTRIBUTING.md. ChangesProject Roadmap
Estimated code review effort🎯 1 (Trivial) | ⏱️ ~3 minutes Poem
🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
|
cc @axiomcura |
There was a problem hiding this comment.
Actionable comments posted: 1
🧹 Nitpick comments (1)
ROADMAP.md (1)
90-90: ⚡ Quick winUse hyphen in compound adjective: "High-Performance Backend".
When a compound adjective precedes a noun, it should be hyphenated for clarity. Correct "High Performance Backend" to "High-Performance Backend".
📝 Proposed fix
- ## Milestone 3 — Parallelization and High Performance Backend (v1.9) + ## Milestone 3 — Parallelization and High-Performance Backend (v1.9)🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@ROADMAP.md` at line 90, Update the heading "Milestone 3 — Parallelization and High Performance Backend (v1.9)" to use a hyphenated compound adjective by changing "High Performance Backend" to "High-Performance Backend" so the heading reads "Milestone 3 — Parallelization and High-Performance Backend (v1.9)".
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@ROADMAP.md`:
- Around line 49-50: The check script currently extracts issue numbers with the
fragile pattern rg -o '#\d+' which wrongly matches hex colors in Mermaid/theme
blocks; update the extraction to only capture real GitHub issue links (e.g.,
match '/issues/\d+' or full markdown link patterns like
'\[.*?\]\(https?://github.com/[^/]+/[^/]+/issues/\d+\)') instead of '#\d+' so
only actual /issues/<number> links (e.g., from ROADMAP.md) are validated; modify
the script's occurrence of rg -o '#\d+' to a stricter regex (such as rg -oP
'/issues/\K\d+' or rg -oP '\(https?://github.com/[^/]+/[^/]+/issues/\K\d+') and
add a quick filter to skip code blocks/mermaid sections if present.
---
Nitpick comments:
In `@ROADMAP.md`:
- Line 90: Update the heading "Milestone 3 — Parallelization and High
Performance Backend (v1.9)" to use a hyphenated compound adjective by changing
"High Performance Backend" to "High-Performance Backend" so the heading reads
"Milestone 3 — Parallelization and High-Performance Backend (v1.9)".
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
| - [ ] Add `ProfileData` dataclass ([#327](https://github.com/cytomining/pycytominer/issues/327)) to bundle a DataFrame with its resolved feature and metadata column lists so they are computed once, not inferred repeatedly across function calls | ||
| - [ ] Add `OutputConfig` dataclass to consolidate the four repeated output parameters (`output_file`, `output_type`, `compression_options`, `float_format`) shared across all core functions into a single per-call object |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
# Extract all issue numbers from ROADMAP.md and check their status
# Get all issue numbers referenced in the file
issues=$(rg -o '#\d+' ROADMAP.md | cut -d: -f2 | sort -u)
echo "Checking GitHub issue status..."
for issue in $issues; do
issue_num=${issue#\#}
echo "Issue $issue:"
# Using GitHub CLI to fetch issue status
gh api repos/cytomining/pycytominer/issues/$issue_num \
--jq '.state, .title' \
2>/dev/null || echo " Not found or not accessible"
doneRepository: cytomining/pycytominer
Length of output: 1728
🏁 Script executed:
#!/bin/bash
set -euo pipefail
echo "== Lines around 49-70 =="
nl -ba ROADMAP.md | sed -n '35,80p'
echo
echo "== All issue-like tokens (`#digits`) in ROADMAP.md =="
rg -o '#[0-9]+' ROADMAP.md | sort -u
echo
echo "== Context for unexpected issue-like tokens (`#0`, `#1`, `#111111`) =="
for t in '`#0`' '`#1`' '`#111111`' ; do
num="${t#\#}"
echo "--- $t ---"
rg -n "\$t" ROADMAP.md || true
done
echo
echo "== Context for originally cited issues =="
for n in 327 633 656 634 635 660 636 705 646 625 269 704 231 272; do
echo "--- #$n ---"
rg -n "#$n" ROADMAP.md || true
done
echo
echo "== Markdown issue links to check (pattern: /issues/<num>) =="
rg -n "github\.com/.*/issues/[0-9]+|cytomining/pycytominer/issues/[0-9]+|/issues/[0-9]+" ROADMAP.md || trueRepository: cytomining/pycytominer
Length of output: 152
🏁 Script executed:
#!/bin/bash
set -euo pipefail
# Use the same extraction as the review script, but also print the exact matching lines
echo "== Extracted issues with surrounding context =="
rg -n -o '#[0-9]+' ROADMAP.md | awk -F: '{print $1 ":" $2}' | sort -u > /tmp/issue_tokens.txt
while IFS= read -r line; do
file_and_lineno="${line%%:*}"
token="${line##*:}"
echo "--- $token (from $file_and_lineno) ---"
# print the line containing the token (may include multiple tokens)
sed -n "$(echo "$file_and_lineno" | sed 's/.*://')p" ROADMAP.md | cat -n | sed -n '1,1p' || true
done < /tmp/issue_tokens.txtRepository: cytomining/pycytominer
Length of output: 3126
FixROADMAP GitHub issue-link validation guidance
- The referenced issue links in
ROADMAP.md(#327,#633,#656,#634,#635,#660,#636,#705,#646,#625,#269,#704,#231,#272) resolve to issues and are currently open. - The proposed check script’s
rg -o '#\d+'also matches hex colors in Mermaid theme variables (e.g.,#111111), causing spurious lookups (e.g.,#0,#1,#6,#111111); restrict extraction to actual/issues/<number>links.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@ROADMAP.md` around lines 49 - 50, The check script currently extracts issue
numbers with the fragile pattern rg -o '#\d+' which wrongly matches hex colors
in Mermaid/theme blocks; update the extraction to only capture real GitHub issue
links (e.g., match '/issues/\d+' or full markdown link patterns like
'\[.*?\]\(https?://github.com/[^/]+/[^/]+/issues/\d+\)') instead of '#\d+' so
only actual /issues/<number> links (e.g., from ROADMAP.md) are validated; modify
the script's occurrence of rg -o '#\d+' to a stricter regex (such as rg -oP
'/issues/\K\d+' or rg -oP '\(https?://github.com/[^/]+/[^/]+/issues/\K\d+') and
add a quick filter to skip code blocks/mermaid sections if present.
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## main #708 +/- ##
=======================================
Coverage 94.04% 94.04%
=======================================
Files 62 62
Lines 4504 4504
=======================================
Hits 4236 4236
Misses 268 268
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Harness. 🚀 New features to boost your workflow:
|
d33bs
left a comment
There was a problem hiding this comment.
Exciting! Adding some initial thoughts.
| : Provenance and logging via chaining | ||
| section Performance | ||
| v1.9 : Parallel execution across plates and batches | ||
| : Pandas to Polars migration inside ProfileData |
There was a problem hiding this comment.
Polars will likely create possible breaking changes with internals. I suggest moving this to 2.0 .
There was a problem hiding this comment.
maybe this is flawed thinking, but we would continue to support v1 API until dropping it in v2. In other words, the functions would continue working with polars or pandas, through ProfileData. It's possible i don't know enough about polars to be 100% sure this is possible.
| ProfileData.from_file("profiles.parquet") | ||
| .aggregate(strata=["Metadata_Well"]) | ||
| .normalize(method="standardize", samples="Metadata_treatment == 'DMSO'") | ||
| .feature_select(operations=["variance_threshold", "correlation_threshold"]) |
There was a problem hiding this comment.
Consider keeping the same existing api accepting similar input and sending to ProfileData on the backend so the workflow remains recognizable and requires no additional change. Here, there could be a comment made to imply that the data are wrapped / converted using the dataclass.
There was a problem hiding this comment.
i don't think i understand this comment. What do you mean "same existing API"? and "sending to ProfileData on the backend"?
Is your suggestion to drop the example chain? I'll add something in the next commit, which may be a middle ground. LMK!
|
|
||
| - [ ] Add pipeline methods to `ProfileData` (`.normalize()`, `.feature_select()`, `.aggregate()`, `.annotate()`, `.consensus()`) — each delegates to the existing standalone function and returns a new `ProfileData` | ||
| - [ ] Core functions accept `ProfileData` as input in addition to `str` / `pd.DataFrame` — no breaking changes | ||
| - [ ] Provenance and logging emerge naturally from chaining: for example, callers can compare `.features` before and after `feature_select` to see exactly which features were dropped by which operation |
There was a problem hiding this comment.
Consider better clarifying what is meant by "emerge naturally". I'd suggest keeping things relatively open when it comes to the implementation, which might have been the intent here but wasn't sure.
There was a problem hiding this comment.
sounds good - yeah, this text is an odd mix of specificity and general thinking. Will revise
| ## Current State — v1.6 | ||
|
|
||
| Pycytominer provides a suite of standalone functions (`aggregate`, `normalize`, `annotate`, `feature_select`, `consensus`) that cover the full image-based profiling pipeline. | ||
| The library supports multiple file formats (CSV, Parquet, AnnData, CytoTable Warehouse), runs on Linux, macOS, and Windows, and is tested across Python 3.10–3.14. |
There was a problem hiding this comment.
The versioning here for Python feels overspecific, consider dropping.
|
|
||
| ### ProfileData | ||
|
|
||
| - [ ] Add `ProfileData` dataclass ([#327](https://github.com/cytomining/pycytominer/issues/327)) to bundle a DataFrame with its resolved feature and metadata column lists so they are computed once, not inferred repeatedly across function calls |
There was a problem hiding this comment.
How will ProfileData integrate with CytoDataFrame, if at all? If we plan to have both it probably means wrapper-inception (wrappers around wrappers). Imagine: we present Pycytominer.ProfileData and CytoDataFrame at the same time and the question comes up: "why have two data objects for similar goals"?
I vote we integrate them and make a push towards a common image-based profiling data type which can store the column information here and much more. While this might make the immediate path complex at first, it shows a more unified front for Cytomining with existing lessons learned from the project there.
There was a problem hiding this comment.
sounds good! so you're intuition is to make all the ProfileData changes we discuss here inside CytoDataFrame? I agree it will be more complex, but probably more stable in the long run!
This begs a question then, why are we creating CytoDataFrame and not simply using AnnData? I think we need to have a compelling answer to this question.
There was a problem hiding this comment.
As a person who has not used CytoDataFrame, one immediate concern with integrating ProfileData directly with CytoDataFrame is dependency weight and compatibility. If CytoDataFrame becomes a hard dependency of Pycytominer, it could narrow Pycytominer’s currently supported Python range and introduce a much larger dependency stack, including notebook, image, and visualization-related packages.
My questions is... do we want that ? 🤔
|
|
||
| ### ProfileData | ||
|
|
||
| - [ ] Add `ProfileData` dataclass ([#327](https://github.com/cytomining/pycytominer/issues/327)) to bundle a DataFrame with its resolved feature and metadata column lists so they are computed once, not inferred repeatedly across function calls |
There was a problem hiding this comment.
Another idea: instead of an upfront cost to the user API and abstractions, could ProfileData live as a decorator pattern to core functions where needed (column filtering cost is likely not high and may change through each core function)? Or is there implied value to users by supplying the dataclass in return? If the latter, maybe we should spell this out briefly here.
There was a problem hiding this comment.
my thought is to reduce redundant code, have a unified place for storing methods, enable functionality for all API functions consistently, and to enable chaining. I like the idea of CytoDataFrame handling this - i'll make a few edits in response
| ### ProfileData | ||
|
|
||
| - [ ] Add `ProfileData` dataclass ([#327](https://github.com/cytomining/pycytominer/issues/327)) to bundle a DataFrame with its resolved feature and metadata column lists so they are computed once, not inferred repeatedly across function calls | ||
| - [ ] Add `OutputConfig` dataclass to consolidate the four repeated output parameters (`output_file`, `output_type`, `compression_options`, `float_format`) shared across all core functions into a single per-call object |
There was a problem hiding this comment.
What if we expanded on the decorator pattern through @write_to_file_if_user_specifies_output_details to help with this instead of a public/user-facing dataclass?
There was a problem hiding this comment.
how does the decision to use cytodataframe influence your thinking on this expansion?
|
|
||
| ### Polars Migration | ||
|
|
||
| - [ ] Swap the DataFrame backend inside `ProfileData` from pandas to Polars — the change is contained within `ProfileData`, isolating all call sites from the migration |
There was a problem hiding this comment.
If we move from Pandas to Polars it's very likely not possible to isolate the changes to only the dataclass. For ex. any dataframe operations would need to undergo an overhaul (the API's and processing decisions differ quite a bit).
There was a problem hiding this comment.
got it! related to a previous comment - i'll adjust
| ### Polars Migration | ||
|
|
||
| - [ ] Swap the DataFrame backend inside `ProfileData` from pandas to Polars — the change is contained within `ProfileData`, isolating all call sites from the migration | ||
| - [ ] Validate performance improvements across the full pipeline |
There was a problem hiding this comment.
Consider addressing "how" we validate, getting a little more specific; for ex. is this through continuous benchmarking?
There was a problem hiding this comment.
i'm ok keeping this vague for now
|
|
||
| - [ ] Swap the DataFrame backend inside `ProfileData` from pandas to Polars — the change is contained within `ProfileData`, isolating all call sites from the migration | ||
| - [ ] Validate performance improvements across the full pipeline | ||
| - [ ] Maintain backward compatibility where possible; document breaking changes clearly |
There was a problem hiding this comment.
Breaking changes imply we move this work to a major release, if following https://semver.org/ conventions.
|
i responded to your comments @d33bs - thanks! I think this will help our thinking quite a bit for the months ahead. Can you take another look? There are some additional discussion items. please also make any changes directly |
axiomcura
left a comment
There was a problem hiding this comment.
Looks good, left some comments !
|
|
||
| ### ProfileData | ||
|
|
||
| - [ ] Add `ProfileData` dataclass ([#327](https://github.com/cytomining/pycytominer/issues/327)) to bundle a DataFrame with its resolved feature and metadata column lists so they are computed once, not inferred repeatedly across function calls |
There was a problem hiding this comment.
As a person who has not used CytoDataFrame, one immediate concern with integrating ProfileData directly with CytoDataFrame is dependency weight and compatibility. If CytoDataFrame becomes a hard dependency of Pycytominer, it could narrow Pycytominer’s currently supported Python range and introduce a much larger dependency stack, including notebook, image, and visualization-related packages.
My questions is... do we want that ? 🤔
| ## Current State — v1.6 | ||
|
|
||
| Pycytominer provides a suite of standalone functions (`aggregate`, `normalize`, `annotate`, `feature_select`, `consensus`) that cover the full image-based profiling pipeline. | ||
| The library supports multiple file formats (CSV, Parquet, AnnData, CytoTable Warehouse), runs on Linux, macOS, and Windows. |
There was a problem hiding this comment.
This make it seem that all file formats work with all OSes, which I'd argue is not true:
https://github.com/gwaybio/pycytominer/blob/edbe8d6203732f13813c90abe086d5d74471b3d3/pycytominer/cyto_utils/load.py#L394
Given the csv.sniffer problem we have. #704
| ### ProfileData | ||
|
|
||
| - [ ] Add `ProfileData` dataclass ([#327](https://github.com/cytomining/pycytominer/issues/327)) to bundle a DataFrame with its resolved feature and metadata column lists so they are computed once, not inferred repeatedly across function calls | ||
| - [ ] Add `OutputConfig` dataclass to consolidate the four repeated output parameters (`output_file`, `output_type`, `compression_options`, `float_format`) shared across all core functions into a single per-call object |
There was a problem hiding this comment.
I like this idea! but do you think that this should probably include or explicitly defer filename-based output type inference. #624
|
|
||
| **Goal:** Make pycytominer fast by enabling parallel execution across the pipeline. | ||
|
|
||
| - [ ] Parallel execution of independent pipeline steps across plates, batches, or wells |
There was a problem hiding this comment.
Maybe this may not be a problem in the field, but I feel like we need some sort of scope rule, not just “plates, batches, or wells.”
One concern with the parallelization is that “across plates, batches, or wells” may be too broad without defining the valid scope for each operation. Some steps are safe to parallelize by plate or batch, but others depend on the full dataset or a specific reference population. For example, normalization, variance filtering, and correlation filtering can produce different results depending on whether they are computed globally or within each partition.
Maybe it is not worth mentioning here, but it could be an issue to keep in mind when developing parallelization support.
Explicitly signaling our pycytominer roadmap
Summary by CodeRabbit