Code and data accompanying the manuscript egusphere-2025-6418: a machine-learning classifier that assigns water samples to phytoplankton clusters using only in-situ optical signals from a custom multichannel fluorimeter (the WETLabs 3X1M / "ECO" sensor, excitation at 440 / 470 / 532 nm), validated against HPLC pigments, flow cytometry, and carbon measurements.
The work is built on the BOUSSOLE time-series station (Mediterranean Sea); cruises
are numbered b224–b236. Controlled laboratory monoculture experiments (Lab/)
provide the reference taxa used in the correspondence analysis (Figure 3).
This is a data-analysis project, not an application: there is no build step. Work
happens by running R scripts (data processing + figures) and Jupyter notebooks (the ML
model). Run everything from the repository root — scripts resolve paths relative to
it (R via here::here(), notebooks via repo-relative paths).
Boussole/ Main dataset and analysis (BOUSSOLE station)
Data/raw/ Instrument outputs: echo/ (3X1M .txt), SBEMOOSE/ (CTD .cnv),
HPLC/ (.xlsx), cyto/ (flow cytometry), carbon/ (POC/PIC)
Data/labo/ Lab monoculture results used by Figure 3
Output/Data/ Per-cruise merged CSVs; Output/Data/Compiled/ holds the joined
datasets that feed the figures and the ML model
Output/Plots/ Generated figures (incl. fig*_recomputed.* publication exports)
Scripts/ R processing pipeline + Python ML notebooks (see below)
Lab/ Controlled lab experiments with monocultures
Data_convert/ Standalone converters for raw ECO logs (Fortran + shell + R)
doc/spectre_exc/ Excitation/emission spectra plotting
R (tested with R 4.3.1):
Rscript install_dependencies.RPython (tested with Python 3.14):
python -m venv .venv
# Windows: .venv\Scripts\activate
# Linux / macOS: source .venv/bin/activate
pip install -r requirements.txtThe processing is sequential and CSV-based — each stage reads the previous stage's
output from Boussole/Output/Data/Compiled/. Order matters:
open_cast.R— the core merge. Parses one CTD.cnv+ one 3X1M echo.txt, aligns them in time, splits ascending/descending casts, writesOutput/Data/bNNN/bNNN_{asc,desc}N.csv. Loops over all cruises at the bottom. Watch for cruise-specific branching keyed on theboussnumber (e.g.b231has a different CTD column config;bouss < 230parses a different echo layout and an extra ECOV2 sensor — this reflects a real mid-campaign sensor reconfiguration).ctd_echo_full_table.R— bins per-cruise CSVs to 1 m depth and concatenates intoctd_multiplexer_all_campains.csv.add_hplc.R/add cyto.R/add_carbon.R/add_cp.R— successively left-join HPLC pigments, cytometry, carbon, and beam attenuation (cp), producing the chainctd_echo_hplc.csv→ctd_echo_hplc_cyto.csv→ctd_echo_hplc_cyto_carbon.csv. (read_hplc_bouss.Rdefines the tolerantread_hplc()parser,source()d by several scripts.)fluorescence_qc.R— quality control of the multispectral fluorescence: dark-count subtraction, rolling-median outlier rejection, smoothing, depth interpolation, band ratios (f470_f440,f532_f440, …) and backscatter (bbp). Produces the clustered datasets and ultimatelyml_dataset.csv.hplc_cluster.Rmd— derives the phytoplanktonclusterlabels from HPLC pigments (the supervised target), via a two-stage correspondence-analysis + hierarchical clustering procedure.
The physical calibration constants in
fluorescence_qc.R(dark-count offsets 49/50/52, per-channel chl conversions, backscatter coefficientski = 1.076,scale_backscattering = 1.906e-6) are instrument calibrations, not arbitrary values — do not change them without understanding the instrument.
Python notebooks in Boussole/Scripts/ consume Output/Data/Compiled/ml_dataset.csv
(features: band ratios f440_f470, f532_f470, plus bb700/cp; target: cluster),
using scikit-learn + imbalanced-learn:
svc_bouss.ipynb— the main in-situ classifier:GradientBoostingClassifierwithGridSearchCV, SMOTE oversampling, and feature-set ablation (full / no-cp / no-532 / mf-only …) scored by balanced accuracy and weighted recall.svc_lab.ipynb— SVC (RBF) on the lab monoculture data, validated withLeaveOneGroupOut/GroupKFoldby culture.prediction.ipynb/pred_cluster.ipynb— apply a trained model to continuous CTD profiles.novelty.ipynb—OneClassSVMnovelty detection.
The compute_fig* scripts regenerate the publication figures at 600 dpi into
Boussole/Output/Plots/ (fig{2,3,4,5,6}_recomputed.*). Run from the repo root:
Rscript Boussole/Scripts/compute_fig2.R # cluster vertical distribution + treemaps
Rscript Boussole/Scripts/compute_fig3.R # correspondence analysis (lab + field clusters)
Rscript Boussole/Scripts/compute_fig4.R # per-cluster precision / recall across configs
python Boussole/Scripts/compute_fig5.py # descriptor importance (MDI ± SD over CV folds)
Rscript Boussole/Scripts/compute_fig6.R # mean weighted recall per config (2/3/4 clusters)Figures 4 and 6 use ggpubr::stat_compare_means(); with more than two groups this
defaults to a Kruskal-Wallis omnibus test (the significance stars in the figures).
Compiled/processed CSVs (under Boussole/Output/Data/) and the raw instrument inputs
(Boussole/Data/raw/) are included so the pipeline can be reproduced end-to-end. HPLC,
cytometry, and carbon measurements originate from the BOUSSOLE programme.
Released under CC BY 4.0 — see LICENSE. If you use this work, please
cite the associated manuscript (egusphere-2025-6418).