Add ephys v2 pipeline tables + force-ingestion script by esutlie · Pull Request #14 · dj-sciops/swc-ucl_aeon-social

esutlie · 2026-03-17T17:32:17Z

Summary

Merges the ephys v2 pipeline from dj_pipeline_ephys (v0.2.0 tag on SWC/aeon_mecha) into the Works pipeline
Adds force_ingest_external_sorting.py — a 19-step script to ingest Dario's pre-sorted Kilosort 2.5 data into the pipeline tables

New files (from merge)

ephys.py — probe, epoch, chunk, and block tables
spike_sorting.py — electrode groups, sorting tasks, spike sorting, curation, SyncedSpikes, UnitMatching
spike_sorting_curation.py — manual/official curation workflow
utils/ephys_utils.py — helper functions
schema/ephys.py — data readers for HarpSync, OnixClock, Bno055

New file (this PR)

scripts/force_ingest_external_sorting.py — 3-phase ingestion script:
- Phase A (steps 1-8): Foundation tables (experiment, probe, epoch, chunks, block)
- Phase B (steps 9-15): Sorting setup + force-insert KS2.5 spike data
- Phase C (steps 16-19): Auto-approve curation, SyncedSpikes, UnitMatching
- Includes spike alignment verification (.bin sample count vs pipeline Clock samples)
- Includes ground truth validation against Dario's pre-computed HARP timestamps

Modified files (from merge)

__init__.py — adds DJ_SUPPORT_FILEPATH_MANAGEMENT + filepath_checksum_size_limit (required by ephys filepath@dj_store columns)
pyproject.toml — datajoint pin >=0.14,<2 (was >=0.13.7); adds spike_sorting extras
paths.py — adds get_sorting_root_dir()
7 schema files — import path swc.aeon.schema → swc.aeon.schema.streams (both work at current aeon_api pin; the direct path becomes required in newer versions)

Discussion items

See comments below — posting each as a separate thread for easier discussion.

Test plan

Static import verification (all files parse, all module references resolve)
Phase 2 testing (after S3 upload + DB deployment) — tracked separately

…s_chunk New ephys pipeline - designed around `ephys_chunk` and `ephys_block`

Replicates UnitMatching.make() logic step-by-step, printing intermediate results. Also compares at various delta_time thresholds to distinguish code bug from genuine lack of matching units.

…issue

Queries actual LENGTH() of longblob columns to identify candidates for externalization to blob@dj_store.

Checks elissas_aeon_ephys_test_ tables when v2 test tables don't exist, so we can measure SortedSpikes and SyncedSpikes.

Switch longblob → blob@dj_store for columns averaging >100 KB: - SortedSpikes.Unit: spike_indices, spike_sites, spike_depths - SyncedSpikes.Unit: spike_times - UnitMatching.Spikes: spike_times Data stored as files on ceph (/ceph/aeon/aeon/dj_store), DB keeps only a 16-byte UUID reference. Reduces DB size significantly for units with ~200K+ spikes per entry.

…Centre/tn/ephys_revamp_v2 Refine unit matching design (spec + schema changes)

- Remove one-off debug/diagnostic scripts: debug_block01_matching.py, debug_unit_matching.py, repair_sync_models.py, measure_blob_sizes.py - Replace hardcoded DB prefix with config-based safety check that blocks production prefix and host - Add single/multi block modes to run_aeon_spike_sorting.py

…Centre/es/test_ephys_revamp_v2 Fix bugs and improvements from ephys v2 testing

…Centre/es/ephys_revamp Ephys v2: unit matching, PK revamp, and external blob storage

The Singularity container pip install fix (esutlie/spikeinterface fix/pip-direct-url-quotes) has been merged upstream as SpikeInterface/spikeinterface#4438. Switch from fork pin to upstream main at the merge commit. Can move to a release pin (0.104.0) when it's available.

…Centre/es/update-spikeinterface-pin Merging this myself since it's a dependency-only change. The fork fix (Singularity container pip install) was merged upstream today as SpikeInterface/spikeinterface#4438, so this just switches the pin from my fork to the upstream merge commit. No code changes. Moving the v0.2.0 tag forward to include this.

…ript Brings in the full ephys v2 pipeline (ephys.py, spike_sorting.py, spike_sorting_curation.py, ephys_utils.py, schema/ephys.py) from the dj_pipeline_ephys branch (v0.2.0 tag). Adds force_ingest_external_sorting.py — a 19-step script to ingest pre-sorted Kilosort 2.5 data into the pipeline tables. Includes spike alignment verification and ground truth validation against Dario's pre-computed HARP timestamps. Other changes from the merge: - __init__.py: adds DJ_SUPPORT_FILEPATH_MANAGEMENT + filepath checksum config - pyproject.toml: datajoint pin >=0.14,<2; adds spike_sorting extras - paths.py: adds get_sorting_root_dir() - schema imports: swc.aeon.schema -> swc.aeon.schema.streams (forward-compatible)

esutlie · 2026-03-17T17:40:53Z

Discussion 1: Data provenance transparency

The force-ingestion script leaves several markers that data was sorted externally:

execution_duration=0 in PreProcessing, SpikeSorting, PostProcessing
Empty File part tables (PreProcessing.File, SpikeSorting.File, PostProcessing.File)
"External KS2.5 sorting — data sorted outside the pipeline" note in SortingParamSet params

If the goal is to publish this data publicly through the platform, should we:

Keep the markers for internal traceability (current approach)
Remove them so the data looks indistinguishable from pipeline-processed data
Something else?

esutlie · 2026-03-17T17:43:48Z

Discussion 2: File part tables for force-ingested data

PreProcessing.File, SpikeSorting.File, and PostProcessing.File are left empty because the pipeline expects SpikeInterface wrapper objects (si_recording.pkl, si_sorting.pkl, sorting_analyzer/) that don't exist for this externally-sorted data.

The raw KS2.5 output files (spike_times.npy, templates.npy, etc.) DO exist on ceph. We could copy them into the expected directory structure and register them in SpikeSorting.File, but they wouldn't be loadable as SI objects.

Options:

Leave empty (current approach) — simplest, but downstream code expecting files will get nothing
Register raw KS files — files exist but aren't SI-compatible
Something else?

esutlie · 2026-03-17T17:48:42Z

Discussion 3: S3 upload — which files are included?

The force-ingestion script needs the small KS output files (spike_times.npy, spike_clusters.npy, templates.npy, cluster_KSLabel.tsv) but not the large .bin or temp_wh.dat files (TB-scale each).

Which files does the existing upload script include/exclude? In particular, will Dario's spike_index_harp_clock_binary_2_147.npy files be uploaded? We're using those as ground truth to validate that our spike time conversion is correct — they contain HARP-synchronized timestamps that Dario pre-computed independently.

Step 15: --skip-alignment flag skips spike alignment verification Step 17: force-insert SyncedSpikes from Dario's ground truth HARP timestamps Both triggered by --chunk-export or --skip-alignment CLI flags

Queries aeon-db2 for EphysChunk, SyncModel, and spike sorting table status. Checks Dario's ground truth files on Ceph. Exports chunk metadata as JSON via --export flag.

- Change PROBE_TYPE from "neuropixels2.0_beta" to "neuropixels - NP2004" to match production DB - Step 17: use spike_times_sync_binary_2_147.npy (float64 HARP seconds) instead of uint64 tick files; skip groups missing sync data gracefully (Chs_193_240 has no sync file — needs Thinh/Dario to generate) - Validation: cross-validate inserted float64 values against independent uint64 HARP tick files (250 MHz), converting ticks to seconds - Upload manifest: add spike_times_sync_binary_2_147.npy as optional file, separate ESSENTIAL_FILES from OPTIONAL_FILES

ttngu207 and others added 30 commits May 21, 2025 14:39

feat: initial ephys pipeline design

50806e1

chore: fix missing .py

70f1d61

chore: updated design - add ElectrodeSelection

7f91902

chore: update design - part 3

41a1f88

chore: table name change

96b822a

chore: add more details, update docstring

4642e34

feat: new schema revision

669b6e8

Merge branch 'datajoint_pipeline' into dev_ephys_chunk

bc8db75

feat: implemented make for spike sorting

7c20ba1

feat: spikesorting with singularity

9baab90

feat(ephys): generate probe using ProbeInterface

4d8ce6a

feat: add ingestion for units, minor improvements

6f885fe

chore: minor bugfix

d1643dc

feat: complete ephys' make simplify QC Metrics

2cb0f8b

chore: code clean up

025491a

feat: ephys default viz with spikeinterface report

a19f467

Update ephys_mock_ingestion.py

aafcec2

Merge branch 'datajoint_pipeline' into dev_ephys_chunk

3121f4e

feat: added EphysChunk & SyncModel ingestion

02561db

feat: implement SyncedSpikes

267b525

feat: add script to run spike sorting

b2dffa6

Merge pull request SainsburyWellcomeCentre#492 from ttngu207/dev_ephy…

f4e965b

…s_chunk New ephys pipeline - designed around `ephys_chunk` and `ephys_block`

autoclear error job in run_spike_sorting script

838636a

chore: week-long spike sorting

62b9f97

Initialise conda

1422a0e

chore: week-long spike sorting

8de1df6

fix(ephys): convert chunk start/end to datetime

558a8de

fix(SyncedSpikes): account for no spikes in chunk

9b3ea6e

Add simple resource monitor script

e8bc797

Add profiler to slurm script

b70984b

esutlie and others added 16 commits March 12, 2026 11:02

Gantt chart: give single-block units a minimum bar width

a0b575d

Gantt chart: bars span full block width edge-to-edge

181ce65

Add targeted diagnostic for block 0↔1 matching failure

5d4ec09

Replicates UnitMatching.make() logic step-by-step, printing intermediate results. Also compares at various delta_time thresholds to distinguish code bug from genuine lack of matching units.

Fix: get_matching() returns Series, not dict — iterate values directly

574d671

Add temporal offset measurement to block matching diagnostic

dbf18a2

Add chunk-level spike time comparison to distinguish sync vs sorting …

0a0046f

…issue

Add diagnostic script to measure blob sizes in ephys tables

92f03b7

Queries actual LENGTH() of longblob columns to identify candidates for externalization to blob@dj_store.

Add fallback to old test schema for blob size measurement

8db61e9

Checks elissas_aeon_ephys_test_ tables when v2 test tables don't exist, so we can measure SortedSpikes and SyncedSpikes.

Merge pull request SainsburyWellcomeCentre#534 from SainsburyWellcome…

d55cecd

…Centre/tn/ephys_revamp_v2 Refine unit matching design (spec + schema changes)

Merge pull request SainsburyWellcomeCentre#538 from SainsburyWellcome…

5331e7c

…Centre/es/test_ephys_revamp_v2 Fix bugs and improvements from ephys v2 testing

Merge pull request SainsburyWellcomeCentre#539 from SainsburyWellcome…

356de6e

…Centre/es/ephys_revamp Ephys v2: unit matching, PK revamp, and external blob storage

esutlie requested a review from ttngu207 March 17, 2026 17:52

esutlie added 10 commits March 24, 2026 15:43

Add Axon upload script for ephys sorting data (S3)

41d6597

Add CHUNK_EXPORT_PATH and SKIP_ALIGNMENT config variables

fa48618

Step 6: support importing EphysChunks from production export JSON

3da0a4e

Step 15: add --skip-alignment flag for Works deployment

9f50dfa

Fix validation dtype handling for datetime64 spike_times

d7212b3

Steps 15,17: support Works deployment without raw data

25e4355

Step 15: --skip-alignment flag skips spike alignment verification Step 17: force-insert SyncedSpikes from Dario's ground truth HARP timestamps Both triggered by --chunk-export or --skip-alignment CLI flags

Fix partial-failure resumption in step 6 and remove dead variable

b17f425

Add exploration script for production ephys data

c4274b0

Queries aeon-db2 for EphysChunk, SyncModel, and spike sorting table status. Checks Dario's ground truth files on Ceph. Exports chunk metadata as JSON via --export flag.

Add investigation script for HARP ticks, probe_type, and Chs_193_240

22f0272

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add ephys v2 pipeline tables + force-ingestion script#14

Add ephys v2 pipeline tables + force-ingestion script#14
esutlie wants to merge 129 commits into
mainfrom
ephys/add-v2-tables

esutlie commented Mar 17, 2026

Uh oh!

esutlie commented Mar 17, 2026

Uh oh!

esutlie commented Mar 17, 2026

Uh oh!

esutlie commented Mar 17, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Uh oh!

Conversation

esutlie commented Mar 17, 2026

Summary

New files (from merge)

New file (this PR)

Modified files (from merge)

Discussion items

Test plan

Uh oh!

esutlie commented Mar 17, 2026

Discussion 1: Data provenance transparency

Uh oh!

esutlie commented Mar 17, 2026

Discussion 2: File part tables for force-ingested data

Uh oh!

esutlie commented Mar 17, 2026

Discussion 3: S3 upload — which files are included?

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants