Multispectral Fluorescence for Phytoplankton Community Discrimination

Code and data accompanying the manuscript egusphere-2025-6418: a machine-learning classifier that assigns water samples to phytoplankton clusters using only in-situ optical signals from a custom multichannel fluorimeter (the WETLabs 3X1M / "ECO" sensor, excitation at 440 / 470 / 532 nm), validated against HPLC pigments, flow cytometry, and carbon measurements.

The work is built on the BOUSSOLE time-series station (Mediterranean Sea); cruises are numbered b224–b236. Controlled laboratory monoculture experiments (Lab/) provide the reference taxa used in the correspondence analysis (Figure 3).

This is a data-analysis project, not an application: there is no build step. Work happens by running R scripts (data processing + figures) and Jupyter notebooks (the ML model). Run everything from the repository root — scripts resolve paths relative to it (R via here::here(), notebooks via repo-relative paths).

Repository layout

Boussole/            Main dataset and analysis (BOUSSOLE station)
  Data/raw/          Instrument outputs: echo/ (3X1M .txt), SBEMOOSE/ (CTD .cnv),
                     HPLC/ (.xlsx), cyto/ (flow cytometry), carbon/ (POC/PIC)
  Data/labo/         Lab monoculture results used by Figure 3
  Output/Data/       Per-cruise merged CSVs; Output/Data/Compiled/ holds the joined
                     datasets that feed the figures and the ML model
  Output/Plots/      Generated figures (incl. fig*_recomputed.* publication exports)
  Scripts/           R processing pipeline + Python ML notebooks (see below)
Lab/                 Controlled lab experiments with monocultures
Data_convert/        Standalone converters for raw ECO logs (Fortran + shell + R)
doc/spectre_exc/     Excitation/emission spectra plotting

Setup

R (tested with R 4.3.1):

Rscript install_dependencies.R

Python (tested with Python 3.14):

python -m venv .venv
# Windows:        .venv\Scripts\activate
# Linux / macOS:  source .venv/bin/activate
pip install -r requirements.txt

The Boussole data pipeline

The processing is sequential and CSV-based — each stage reads the previous stage's output from Boussole/Output/Data/Compiled/. Order matters:

open_cast.R — the core merge. Parses one CTD .cnv + one 3X1M echo .txt, aligns them in time, splits ascending/descending casts, writes Output/Data/bNNN/bNNN_{asc,desc}N.csv. Loops over all cruises at the bottom. Watch for cruise-specific branching keyed on the bouss number (e.g. b231 has a different CTD column config; bouss < 230 parses a different echo layout and an extra ECOV2 sensor — this reflects a real mid-campaign sensor reconfiguration).
ctd_echo_full_table.R — bins per-cruise CSVs to 1 m depth and concatenates into ctd_multiplexer_all_campains.csv.
add_hplc.R / add cyto.R / add_carbon.R / add_cp.R — successively left-join HPLC pigments, cytometry, carbon, and beam attenuation (cp), producing the chain ctd_echo_hplc.csv → ctd_echo_hplc_cyto.csv → ctd_echo_hplc_cyto_carbon.csv. (read_hplc_bouss.R defines the tolerant read_hplc() parser, source()d by several scripts.)
fluorescence_qc.R — quality control of the multispectral fluorescence: dark-count subtraction, rolling-median outlier rejection, smoothing, depth interpolation, band ratios (f470_f440, f532_f440, …) and backscatter (bbp). Produces the clustered datasets and ultimately ml_dataset.csv.
hplc_cluster.Rmd — derives the phytoplankton cluster labels from HPLC pigments (the supervised target), via a two-stage correspondence-analysis + hierarchical clustering procedure.

The physical calibration constants in fluorescence_qc.R (dark-count offsets 49/50/52, per-channel chl conversions, backscatter coefficients ki = 1.076, scale_backscattering = 1.906e-6) are instrument calibrations, not arbitrary values — do not change them without understanding the instrument.

The ML model

Python notebooks in Boussole/Scripts/ consume Output/Data/Compiled/ml_dataset.csv (features: band ratios f440_f470, f532_f470, plus bb700/cp; target: cluster), using scikit-learn + imbalanced-learn:

svc_bouss.ipynb — the main in-situ classifier: GradientBoostingClassifier with GridSearchCV, SMOTE oversampling, and feature-set ablation (full / no-cp / no-532 / mf-only …) scored by balanced accuracy and weighted recall.
svc_lab.ipynb — SVC (RBF) on the lab monoculture data, validated with LeaveOneGroupOut / GroupKFold by culture.
prediction.ipynb / pred_cluster.ipynb — apply a trained model to continuous CTD profiles.
novelty.ipynb — OneClassSVM novelty detection.

Reproducing the manuscript figures

The compute_fig* scripts regenerate the publication figures at 600 dpi into Boussole/Output/Plots/ (fig{2,3,4,5,6}_recomputed.*). Run from the repo root:

Rscript Boussole/Scripts/compute_fig2.R   # cluster vertical distribution + treemaps
Rscript Boussole/Scripts/compute_fig3.R   # correspondence analysis (lab + field clusters)
Rscript Boussole/Scripts/compute_fig4.R   # per-cluster precision / recall across configs
python  Boussole/Scripts/compute_fig5.py  # descriptor importance (MDI ± SD over CV folds)
Rscript Boussole/Scripts/compute_fig6.R   # mean weighted recall per config (2/3/4 clusters)

Figures 4 and 6 use ggpubr::stat_compare_means(); with more than two groups this defaults to a Kruskal-Wallis omnibus test (the significance stars in the figures).

Data availability

Compiled/processed CSVs (under Boussole/Output/Data/) and the raw instrument inputs (Boussole/Data/raw/) are included so the pipeline can be reproduced end-to-end. HPLC, cytometry, and carbon measurements originate from the BOUSSOLE programme.

License & citation

Released under CC BY 4.0 — see LICENSE. If you use this work, please cite the associated manuscript (egusphere-2025-6418).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Multispectral Fluorescence for Phytoplankton Community Discrimination

Repository layout

Setup

The Boussole data pipeline

The ML model

Reproducing the manuscript figures

Data availability

License & citation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 79 Commits
Boussole		Boussole
Data_convert		Data_convert
Lab		Lab
doc/spectre_exc		doc/spectre_exc
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
install_dependencies.R		install_dependencies.R
mf.Rproj		mf.Rproj
requirements.txt		requirements.txt

Folders and files

Latest commit

History

Repository files navigation

Multispectral Fluorescence for Phytoplankton Community Discrimination

Repository layout

Setup

The Boussole data pipeline

The ML model

Reproducing the manuscript figures

Data availability

License & citation

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages