Skip to content

Add ephys v2 pipeline tables + force-ingestion script#14

Open
esutlie wants to merge 129 commits into
mainfrom
ephys/add-v2-tables
Open

Add ephys v2 pipeline tables + force-ingestion script#14
esutlie wants to merge 129 commits into
mainfrom
ephys/add-v2-tables

Conversation

@esutlie

@esutlie esutlie commented Mar 17, 2026

Copy link
Copy Markdown

Summary

  • Merges the ephys v2 pipeline from dj_pipeline_ephys (v0.2.0 tag on SWC/aeon_mecha) into the Works pipeline
  • Adds force_ingest_external_sorting.py — a 19-step script to ingest Dario's pre-sorted Kilosort 2.5 data into the pipeline tables

New files (from merge)

  • ephys.py — probe, epoch, chunk, and block tables
  • spike_sorting.py — electrode groups, sorting tasks, spike sorting, curation, SyncedSpikes, UnitMatching
  • spike_sorting_curation.py — manual/official curation workflow
  • utils/ephys_utils.py — helper functions
  • schema/ephys.py — data readers for HarpSync, OnixClock, Bno055

New file (this PR)

  • scripts/force_ingest_external_sorting.py — 3-phase ingestion script:
    • Phase A (steps 1-8): Foundation tables (experiment, probe, epoch, chunks, block)
    • Phase B (steps 9-15): Sorting setup + force-insert KS2.5 spike data
    • Phase C (steps 16-19): Auto-approve curation, SyncedSpikes, UnitMatching
    • Includes spike alignment verification (.bin sample count vs pipeline Clock samples)
    • Includes ground truth validation against Dario's pre-computed HARP timestamps

Modified files (from merge)

  • __init__.py — adds DJ_SUPPORT_FILEPATH_MANAGEMENT + filepath_checksum_size_limit (required by ephys filepath@dj_store columns)
  • pyproject.toml — datajoint pin >=0.14,<2 (was >=0.13.7); adds spike_sorting extras
  • paths.py — adds get_sorting_root_dir()
  • 7 schema files — import path swc.aeon.schemaswc.aeon.schema.streams (both work at current aeon_api pin; the direct path becomes required in newer versions)

Discussion items

See comments below — posting each as a separate thread for easier discussion.

Test plan

  • Static import verification (all files parse, all module references resolve)
  • Phase 2 testing (after S3 upload + DB deployment) — tracked separately

ttngu207 and others added 30 commits May 21, 2025 14:39
…s_chunk

New ephys pipeline - designed around `ephys_chunk` and `ephys_block`
esutlie and others added 16 commits March 12, 2026 11:02
Replicates UnitMatching.make() logic step-by-step, printing
intermediate results. Also compares at various delta_time thresholds
to distinguish code bug from genuine lack of matching units.
Queries actual LENGTH() of longblob columns to identify candidates
for externalization to blob@dj_store.
Checks elissas_aeon_ephys_test_ tables when v2 test tables
don't exist, so we can measure SortedSpikes and SyncedSpikes.
Switch longblob → blob@dj_store for columns averaging >100 KB:
- SortedSpikes.Unit: spike_indices, spike_sites, spike_depths
- SyncedSpikes.Unit: spike_times
- UnitMatching.Spikes: spike_times

Data stored as files on ceph (/ceph/aeon/aeon/dj_store), DB keeps
only a 16-byte UUID reference. Reduces DB size significantly for
units with ~200K+ spikes per entry.
…Centre/tn/ephys_revamp_v2

Refine unit matching design (spec + schema changes)
- Remove one-off debug/diagnostic scripts:
  debug_block01_matching.py, debug_unit_matching.py,
  repair_sync_models.py, measure_blob_sizes.py
- Replace hardcoded DB prefix with config-based safety check
  that blocks production prefix and host
- Add single/multi block modes to run_aeon_spike_sorting.py
…Centre/es/test_ephys_revamp_v2

Fix bugs and improvements from ephys v2 testing
…Centre/es/ephys_revamp

Ephys v2: unit matching, PK revamp, and external blob storage
The Singularity container pip install fix (esutlie/spikeinterface
fix/pip-direct-url-quotes) has been merged upstream as
SpikeInterface/spikeinterface#4438. Switch from fork pin to upstream
main at the merge commit. Can move to a release pin (0.104.0) when
it's available.
…Centre/es/update-spikeinterface-pin

Merging this myself since it's a dependency-only change. The fork fix (Singularity container pip install) was merged upstream today as SpikeInterface/spikeinterface#4438, so this just switches the pin from my fork to the upstream merge commit. No code changes. Moving the v0.2.0 tag forward to include this.
…ript

Brings in the full ephys v2 pipeline (ephys.py, spike_sorting.py,
spike_sorting_curation.py, ephys_utils.py, schema/ephys.py) from
the dj_pipeline_ephys branch (v0.2.0 tag).

Adds force_ingest_external_sorting.py — a 19-step script to ingest
pre-sorted Kilosort 2.5 data into the pipeline tables. Includes
spike alignment verification and ground truth validation against
Dario's pre-computed HARP timestamps.

Other changes from the merge:
- __init__.py: adds DJ_SUPPORT_FILEPATH_MANAGEMENT + filepath checksum config
- pyproject.toml: datajoint pin >=0.14,<2; adds spike_sorting extras
- paths.py: adds get_sorting_root_dir()
- schema imports: swc.aeon.schema -> swc.aeon.schema.streams (forward-compatible)
@esutlie

esutlie commented Mar 17, 2026

Copy link
Copy Markdown
Author

Discussion 1: Data provenance transparency

The force-ingestion script leaves several markers that data was sorted externally:

  • execution_duration=0 in PreProcessing, SpikeSorting, PostProcessing
  • Empty File part tables (PreProcessing.File, SpikeSorting.File, PostProcessing.File)
  • "External KS2.5 sorting — data sorted outside the pipeline" note in SortingParamSet params

If the goal is to publish this data publicly through the platform, should we:

  1. Keep the markers for internal traceability (current approach)
  2. Remove them so the data looks indistinguishable from pipeline-processed data
  3. Something else?

@esutlie

esutlie commented Mar 17, 2026

Copy link
Copy Markdown
Author

Discussion 2: File part tables for force-ingested data

PreProcessing.File, SpikeSorting.File, and PostProcessing.File are left empty because the pipeline expects SpikeInterface wrapper objects (si_recording.pkl, si_sorting.pkl, sorting_analyzer/) that don't exist for this externally-sorted data.

The raw KS2.5 output files (spike_times.npy, templates.npy, etc.) DO exist on ceph. We could copy them into the expected directory structure and register them in SpikeSorting.File, but they wouldn't be loadable as SI objects.

Options:

  1. Leave empty (current approach) — simplest, but downstream code expecting files will get nothing
  2. Register raw KS files — files exist but aren't SI-compatible
  3. Something else?

@esutlie

esutlie commented Mar 17, 2026

Copy link
Copy Markdown
Author

Discussion 3: S3 upload — which files are included?

The force-ingestion script needs the small KS output files (spike_times.npy, spike_clusters.npy, templates.npy, cluster_KSLabel.tsv) but not the large .bin or temp_wh.dat files (TB-scale each).

Which files does the existing upload script include/exclude? In particular, will Dario's spike_index_harp_clock_binary_2_147.npy files be uploaded? We're using those as ground truth to validate that our spike time conversion is correct — they contain HARP-synchronized timestamps that Dario pre-computed independently.

@esutlie esutlie requested a review from ttngu207 March 17, 2026 17:52
esutlie added 10 commits March 24, 2026 15:43
Step 15: --skip-alignment flag skips spike alignment verification
Step 17: force-insert SyncedSpikes from Dario's ground truth HARP timestamps
Both triggered by --chunk-export or --skip-alignment CLI flags
Queries aeon-db2 for EphysChunk, SyncModel, and spike sorting table
status. Checks Dario's ground truth files on Ceph. Exports chunk
metadata as JSON via --export flag.
- Change PROBE_TYPE from "neuropixels2.0_beta" to "neuropixels - NP2004"
  to match production DB
- Step 17: use spike_times_sync_binary_2_147.npy (float64 HARP seconds)
  instead of uint64 tick files; skip groups missing sync data gracefully
  (Chs_193_240 has no sync file — needs Thinh/Dario to generate)
- Validation: cross-validate inserted float64 values against independent
  uint64 HARP tick files (250 MHz), converting ticks to seconds
- Upload manifest: add spike_times_sync_binary_2_147.npy as optional file,
  separate ESSENTIAL_FILES from OPTIONAL_FILES
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants