Add ephys v2 pipeline tables + force-ingestion script#14
Conversation
…s_chunk New ephys pipeline - designed around `ephys_chunk` and `ephys_block`
Replicates UnitMatching.make() logic step-by-step, printing intermediate results. Also compares at various delta_time thresholds to distinguish code bug from genuine lack of matching units.
Queries actual LENGTH() of longblob columns to identify candidates for externalization to blob@dj_store.
Checks elissas_aeon_ephys_test_ tables when v2 test tables don't exist, so we can measure SortedSpikes and SyncedSpikes.
Switch longblob → blob@dj_store for columns averaging >100 KB: - SortedSpikes.Unit: spike_indices, spike_sites, spike_depths - SyncedSpikes.Unit: spike_times - UnitMatching.Spikes: spike_times Data stored as files on ceph (/ceph/aeon/aeon/dj_store), DB keeps only a 16-byte UUID reference. Reduces DB size significantly for units with ~200K+ spikes per entry.
…Centre/tn/ephys_revamp_v2 Refine unit matching design (spec + schema changes)
- Remove one-off debug/diagnostic scripts: debug_block01_matching.py, debug_unit_matching.py, repair_sync_models.py, measure_blob_sizes.py - Replace hardcoded DB prefix with config-based safety check that blocks production prefix and host - Add single/multi block modes to run_aeon_spike_sorting.py
…Centre/es/test_ephys_revamp_v2 Fix bugs and improvements from ephys v2 testing
…Centre/es/ephys_revamp Ephys v2: unit matching, PK revamp, and external blob storage
The Singularity container pip install fix (esutlie/spikeinterface fix/pip-direct-url-quotes) has been merged upstream as SpikeInterface/spikeinterface#4438. Switch from fork pin to upstream main at the merge commit. Can move to a release pin (0.104.0) when it's available.
…Centre/es/update-spikeinterface-pin Merging this myself since it's a dependency-only change. The fork fix (Singularity container pip install) was merged upstream today as SpikeInterface/spikeinterface#4438, so this just switches the pin from my fork to the upstream merge commit. No code changes. Moving the v0.2.0 tag forward to include this.
…ript Brings in the full ephys v2 pipeline (ephys.py, spike_sorting.py, spike_sorting_curation.py, ephys_utils.py, schema/ephys.py) from the dj_pipeline_ephys branch (v0.2.0 tag). Adds force_ingest_external_sorting.py — a 19-step script to ingest pre-sorted Kilosort 2.5 data into the pipeline tables. Includes spike alignment verification and ground truth validation against Dario's pre-computed HARP timestamps. Other changes from the merge: - __init__.py: adds DJ_SUPPORT_FILEPATH_MANAGEMENT + filepath checksum config - pyproject.toml: datajoint pin >=0.14,<2; adds spike_sorting extras - paths.py: adds get_sorting_root_dir() - schema imports: swc.aeon.schema -> swc.aeon.schema.streams (forward-compatible)
Discussion 1: Data provenance transparencyThe force-ingestion script leaves several markers that data was sorted externally:
If the goal is to publish this data publicly through the platform, should we:
|
Discussion 2: File part tables for force-ingested data
The raw KS2.5 output files ( Options:
|
Discussion 3: S3 upload — which files are included?The force-ingestion script needs the small KS output files ( Which files does the existing upload script include/exclude? In particular, will Dario's |
Step 15: --skip-alignment flag skips spike alignment verification Step 17: force-insert SyncedSpikes from Dario's ground truth HARP timestamps Both triggered by --chunk-export or --skip-alignment CLI flags
Queries aeon-db2 for EphysChunk, SyncModel, and spike sorting table status. Checks Dario's ground truth files on Ceph. Exports chunk metadata as JSON via --export flag.
- Change PROBE_TYPE from "neuropixels2.0_beta" to "neuropixels - NP2004" to match production DB - Step 17: use spike_times_sync_binary_2_147.npy (float64 HARP seconds) instead of uint64 tick files; skip groups missing sync data gracefully (Chs_193_240 has no sync file — needs Thinh/Dario to generate) - Validation: cross-validate inserted float64 values against independent uint64 HARP tick files (250 MHz), converting ticks to seconds - Upload manifest: add spike_times_sync_binary_2_147.npy as optional file, separate ESSENTIAL_FILES from OPTIONAL_FILES
Summary
dj_pipeline_ephys(v0.2.0 tag on SWC/aeon_mecha) into the Works pipelineforce_ingest_external_sorting.py— a 19-step script to ingest Dario's pre-sorted Kilosort 2.5 data into the pipeline tablesNew files (from merge)
ephys.py— probe, epoch, chunk, and block tablesspike_sorting.py— electrode groups, sorting tasks, spike sorting, curation, SyncedSpikes, UnitMatchingspike_sorting_curation.py— manual/official curation workflowutils/ephys_utils.py— helper functionsschema/ephys.py— data readers for HarpSync, OnixClock, Bno055New file (this PR)
scripts/force_ingest_external_sorting.py— 3-phase ingestion script:Modified files (from merge)
__init__.py— addsDJ_SUPPORT_FILEPATH_MANAGEMENT+filepath_checksum_size_limit(required by ephysfilepath@dj_storecolumns)pyproject.toml— datajoint pin>=0.14,<2(was>=0.13.7); addsspike_sortingextraspaths.py— addsget_sorting_root_dir()swc.aeon.schema→swc.aeon.schema.streams(both work at current aeon_api pin; the direct path becomes required in newer versions)Discussion items
See comments below — posting each as a separate thread for easier discussion.
Test plan