Skip to content

Flavi1P/Multispectral-Fluorescence

Repository files navigation

Multispectral Fluorescence for Phytoplankton Community Discrimination

Code and data accompanying the manuscript egusphere-2025-6418: a machine-learning classifier that assigns water samples to phytoplankton clusters using only in-situ optical signals from a custom multichannel fluorimeter (the WETLabs 3X1M / "ECO" sensor, excitation at 440 / 470 / 532 nm), validated against HPLC pigments, flow cytometry, and carbon measurements.

The work is built on the BOUSSOLE time-series station (Mediterranean Sea); cruises are numbered b224b236. Controlled laboratory monoculture experiments (Lab/) provide the reference taxa used in the correspondence analysis (Figure 3).

This is a data-analysis project, not an application: there is no build step. Work happens by running R scripts (data processing + figures) and Jupyter notebooks (the ML model). Run everything from the repository root — scripts resolve paths relative to it (R via here::here(), notebooks via repo-relative paths).

Repository layout

Boussole/            Main dataset and analysis (BOUSSOLE station)
  Data/raw/          Instrument outputs: echo/ (3X1M .txt), SBEMOOSE/ (CTD .cnv),
                     HPLC/ (.xlsx), cyto/ (flow cytometry), carbon/ (POC/PIC)
  Data/labo/         Lab monoculture results used by Figure 3
  Output/Data/       Per-cruise merged CSVs; Output/Data/Compiled/ holds the joined
                     datasets that feed the figures and the ML model
  Output/Plots/      Generated figures (incl. fig*_recomputed.* publication exports)
  Scripts/           R processing pipeline + Python ML notebooks (see below)
Lab/                 Controlled lab experiments with monocultures
Data_convert/        Standalone converters for raw ECO logs (Fortran + shell + R)
doc/spectre_exc/     Excitation/emission spectra plotting

Setup

R (tested with R 4.3.1):

Rscript install_dependencies.R

Python (tested with Python 3.14):

python -m venv .venv
# Windows:        .venv\Scripts\activate
# Linux / macOS:  source .venv/bin/activate
pip install -r requirements.txt

The Boussole data pipeline

The processing is sequential and CSV-based — each stage reads the previous stage's output from Boussole/Output/Data/Compiled/. Order matters:

  1. open_cast.R — the core merge. Parses one CTD .cnv + one 3X1M echo .txt, aligns them in time, splits ascending/descending casts, writes Output/Data/bNNN/bNNN_{asc,desc}N.csv. Loops over all cruises at the bottom. Watch for cruise-specific branching keyed on the bouss number (e.g. b231 has a different CTD column config; bouss < 230 parses a different echo layout and an extra ECOV2 sensor — this reflects a real mid-campaign sensor reconfiguration).
  2. ctd_echo_full_table.R — bins per-cruise CSVs to 1 m depth and concatenates into ctd_multiplexer_all_campains.csv.
  3. add_hplc.R / add cyto.R / add_carbon.R / add_cp.R — successively left-join HPLC pigments, cytometry, carbon, and beam attenuation (cp), producing the chain ctd_echo_hplc.csvctd_echo_hplc_cyto.csvctd_echo_hplc_cyto_carbon.csv. (read_hplc_bouss.R defines the tolerant read_hplc() parser, source()d by several scripts.)
  4. fluorescence_qc.R — quality control of the multispectral fluorescence: dark-count subtraction, rolling-median outlier rejection, smoothing, depth interpolation, band ratios (f470_f440, f532_f440, …) and backscatter (bbp). Produces the clustered datasets and ultimately ml_dataset.csv.
  5. hplc_cluster.Rmd — derives the phytoplankton cluster labels from HPLC pigments (the supervised target), via a two-stage correspondence-analysis + hierarchical clustering procedure.

The physical calibration constants in fluorescence_qc.R (dark-count offsets 49/50/52, per-channel chl conversions, backscatter coefficients ki = 1.076, scale_backscattering = 1.906e-6) are instrument calibrations, not arbitrary values — do not change them without understanding the instrument.

The ML model

Python notebooks in Boussole/Scripts/ consume Output/Data/Compiled/ml_dataset.csv (features: band ratios f440_f470, f532_f470, plus bb700/cp; target: cluster), using scikit-learn + imbalanced-learn:

  • svc_bouss.ipynb — the main in-situ classifier: GradientBoostingClassifier with GridSearchCV, SMOTE oversampling, and feature-set ablation (full / no-cp / no-532 / mf-only …) scored by balanced accuracy and weighted recall.
  • svc_lab.ipynb — SVC (RBF) on the lab monoculture data, validated with LeaveOneGroupOut / GroupKFold by culture.
  • prediction.ipynb / pred_cluster.ipynb — apply a trained model to continuous CTD profiles.
  • novelty.ipynbOneClassSVM novelty detection.

Reproducing the manuscript figures

The compute_fig* scripts regenerate the publication figures at 600 dpi into Boussole/Output/Plots/ (fig{2,3,4,5,6}_recomputed.*). Run from the repo root:

Rscript Boussole/Scripts/compute_fig2.R   # cluster vertical distribution + treemaps
Rscript Boussole/Scripts/compute_fig3.R   # correspondence analysis (lab + field clusters)
Rscript Boussole/Scripts/compute_fig4.R   # per-cluster precision / recall across configs
python  Boussole/Scripts/compute_fig5.py  # descriptor importance (MDI ± SD over CV folds)
Rscript Boussole/Scripts/compute_fig6.R   # mean weighted recall per config (2/3/4 clusters)

Figures 4 and 6 use ggpubr::stat_compare_means(); with more than two groups this defaults to a Kruskal-Wallis omnibus test (the significance stars in the figures).

Data availability

Compiled/processed CSVs (under Boussole/Output/Data/) and the raw instrument inputs (Boussole/Data/raw/) are included so the pipeline can be reproduced end-to-end. HPLC, cytometry, and carbon measurements originate from the BOUSSOLE programme.

License & citation

Released under CC BY 4.0 — see LICENSE. If you use this work, please cite the associated manuscript (egusphere-2025-6418).

About

Analysis of in situ multispectral fluorescence

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors