Skip to content

Latest commit

 

History

History
170 lines (127 loc) · 6.06 KB

File metadata and controls

170 lines (127 loc) · 6.06 KB

psptools

Open-source R package developed for making shellfish toxicity forecasts using regulatory monitoring data

Setup

To run this package, you will need Tensorflow, Keras and all of their dependencies installed and configured on your machine.

For first time Keras users, follow these steps to setup your machine for running the psptools software:

install.packages("keras")
library(reticulate)
virtualenv_create("r-reticulate", python=install_python())

library(keras)
install_keras(envname = "r-reticulate")

Package Installation

remotes::install_github("BigelowLab/psptools")

Usage

Here we provide a brief demonstration of going from raw input data to a forecast of shellfish toxicity measurements. For full documentation of each step, please refer to psptools-guide.

Input data

Prepare your input data for transforming into model input. The input table must have the columns:

  • The date and year each toxicity measurement was taken (date + year)

  • A unique location identifier where the sample was taken (location_id)

  • The total measured toxicity which will be used to predict (total_toxicity)

  • The species (or genus) of shellfish measured (species)

  • The region the sample is from, although the region may be the same for both train and test sets (region)

  • Individual columns used for model input ie 12 congeners (gtx4, gtx1, etc…)

input_data <- read_psp_data(model_ready = TRUE)

glimpse(input_data)
## Rows: 15,200
## Columns: 20
## $ id             <chr> "12091_2019-07-10_arctica", "12092_2019-07-10_arctica",…
## $ location_id    <chr> "12091", "12092", "12125", "ACINC", "ACINC", "ACInc", "…
## $ species        <chr> "arctica", "arctica", "arctica", "arctica", "arctica", …
## $ total_toxicity <dbl> 31.4491000, 44.8684900, 70.3315700, 0.3268534, 5.949918…
## $ gtx4           <dbl> 0.000000, 8.088602, 17.028635, 0.000000, 0.000000, 0.00…
## $ gtx1           <dbl> 4.831219, 6.642926, 12.681950, 0.000000, 0.000000, 0.00…
## $ dcgtx3         <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ gtx5           <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ dcgtx2         <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ gtx3           <dbl> 18.3904384, 22.9927635, 28.5759119, 0.3268534, 4.390941…
## $ gtx2           <dbl> 4.982137, 7.144196, 8.679571, 0.000000, 0.000000, 0.000…
## $ neo            <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ dcstx          <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ stx            <dbl> 3.245307, 0.000000, 3.365504, 0.000000, 0.000000, 0.000…
## $ c1             <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ c2             <dbl> 0.000000, 0.000000, 0.000000, 0.000000, 1.558977, 0.000…
## $ date           <date> 2019-07-10, 2019-07-10, 2019-07-10, 2018-04-27, 2019-0…
## $ year           <chr> "2019", "2019", "2019", "2018", "2019", "2019", "2019",…
## $ region         <chr> "maine", "maine", "maine", "maine", "maine", "maine", "…
## $ gap_days       <dbl> 0, 0, 0, 0, 0, 0, 2, 62, 0, 0, 14, 18, 0, 0, 11, 15, 17…

Model Configuration Testing

cfg <- list(
  configuration="irish_2week",
  image_list = list(tox_levels = c(0,10,30,80),
                    forecast_steps = 1,
                    n_steps = 2,
                    minimum_gap = 4,
                    maximum_gap = 10,
                    multisample_weeks = "last",
                    toxins = c("gtx1", "gtx4", "gtx2", "gtx3", "dcgtx2", "dcgtx3", "gtx5", "neo", "dcstx", "stx", "c1", "c2")),
  model = list(balance_val_set=FALSE,
               downsample=FALSE,
               use_class_weights=FALSE,
               dropout1 = 0.3,
               dropout2 = 0.3,
               batch_size = 32, 
               units1 = 32, 
               units2 = 32, 
               epochs = 64, 
               validation_split = 0.2,
               shuffle = TRUE,
               num_classes = 4,
               optimizer="adam",
               loss_function="categorical_crossentropy",
               model_metrics=c("categorical_accuracy")),
  train_test = list(split_by = "year_region_species",
                    train = list(species = "mytilus",
                                 region = "maine",
                                 year=c("2014", "2015", "2016", "2017", "2018", "2019", "2020", "2021", "2022", "2023")), 
                    test = list(species = "mytilus",
                                region = "maine",
                                year = c("2024")))
)

Use run_ensemble_test() to test the performance of a model configuration

The output includes the metrics of predictions from an ensemble of n_runs,

test$ensemble_metrics
## # A tibble: 1 × 10
##      tp    fp    tn    fn accuracy cl_accuracy   f_1 precision sensitivity
##   <int> <int> <int> <int>    <dbl>       <dbl> <dbl>     <dbl>       <dbl>
## 1     5     8   425     4    0.717       0.973 0.455     0.385       0.556
## # ℹ 1 more variable: specificity <dbl>

and an ensemble confusion matrix

test$ensemble_cm

Forecasting

To make real-time predictions, assign split_by = "forecast_mode" in the configuration object, then run:

forecast <- run_ensemble_forecast(cfg, input_data, ...)

Requirements from CRAN