Skip to content

NCI-CISNET/shg-r

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

493 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Smoking History Generator: R Interface

R-CMD-check License: GPL-3

About

This R package provides a convenient interface to the CISNET Smoking History Generator. It can produce the identical outputs as the command-line version (CLI) of the Smoking History Generator in R and offers an easy way for modelers to access the Smoking History Generator directly in R.

Getting Started

Installation from CRAN

install.packages("SmokingHistoryGenerator")

Installation from GitHub

install.packages("pak")
pak::pak("NCI-CISNET/shg-r")
# OR
pak::pak("NCI-CISNET/shg-r@[optional-branch-of-your-choice]")

Precompiled binary from GitHub Releases (optional)

Releases ship per-OS binaries from R CMD INSTALL --build. Download the asset in your browser from Releases (no GitHub token), then install from the saved file.

macOS (Apple Silicon) — use the exact downloaded filename (including (1) if the browser added it). R 4.6+ no longer accepts type = "binary" for macOS CRAN builds; pass this session’s native binary type (or use R CMD INSTALL below):

pkg_tgz <- path.expand("~/Downloads/SmokingHistoryGenerator_6.5.2-1.0.0_macos-arm64.tgz")
stopifnot(file.exists(pkg_tgz))
install.packages(pkg_tgz, repos = NULL, type = .Platform$pkgType)

On older R, .Platform$pkgType is still the right choice when it is not "source". Shell install avoids the type argument entirely:

R CMD INSTALL /path/to/SmokingHistoryGenerator_6.5.2-1.0.0_macos-arm64.tgz

Intel Macs use _macos-x64.tgz. Windows and Linux assets use .zip / *_linux-*_R_*.tar.gz with the same install.packages(..., repos = NULL, type = .Platform$pkgType) idea when your R build reports a non-source pkg type.

Loading parameter sets

The SHG needs calibrated input files (initiation, cessation, CPD, and mortality tables). The package ships a default CRAN-sized NHIS-1965–2018 csv-partial under inst/extdata/2018/ (smoking/, mortality/). Full NHIS-style tables are distributed as parameter bundles via Zenodo (and GitHub Releases). See ?shg_load_params for bundle URLs, ACM vs OCM mortality, authentication, and cache behavior.

Traditional workflow: local folder with already-uncompressed files

Use this when you already have a local directory containing smoking/ and mortality/ files and want to point SHG directly at those inputs.

library(SmokingHistoryGenerator)
shg <- new(SHGInterface)

shg$input_data_folder   <- "/path/to/usa-national@smok-2018-mort-2016"
shg$initiation_filename <- "smoking/initiation.csv"
shg$cessation_filename  <- "smoking/cessation.csv"
shg$cpd_filename        <- "smoking/cpd.csv"
shg$mortality_filename  <- "mortality/acm.csv"  # or mortality/ocm-excl-lung-cancer.csv

run_cfg <- list(
  individuals = 1e5,
  race = 0,
  sex = 0,
  cohort_year = 1980
)
bundle <- shg$runSim(run_cfg)
sim <- bundle$results

Recommended workflow: config list + bundle zip path

Use a single config list that includes both bundle provenance and run fields. For now this example uses a local zip path; later this can point to a Zenodo URL.

library(SmokingHistoryGenerator)
shg <- new(SHGInterface)

# Local zip path for now (replace with Zenodo URL when published).
# Git checkout: tests/testdata/usa-national@smok-2018-mort-2016.zip
zip_path <- "/path/to/usa-national@smok-2018-mort-2016.zip"

run_cfg <- list(
  params_bundle_source = zip_path,
  params_mortality = "acm",   # or "ocm"; alias `mortality = "ocm"` also works
  individuals = 1e5,
  race = 0,
  sex = 0,
  cohort_year = 1980
)

# Hydrate tables from bundle metadata in config
shg_apply_config(shg, run_cfg)

# Single run call returns coupled outputs
bundle <- shg$runSim(run_cfg)
sim <- bundle$results

Future Zenodo variant (same pattern; replace xxxx with the published record id):

run_cfg <- list(
  params_bundle_source = "https://zenodo.org/records/xxxx/files/usa-national@smok-2018-mort-2016.zip",
  params_mortality = "acm",
  individuals = 1e5,
  race = 0,
  sex = 0,
  cohort_year = 1980
)

The bundle is downloaded/extracted once and cached locally; subsequent calls reuse the cache.


Basic usage

Using a config list that includes a parameter bundle source (recommended), you can launch a smoking history simulation as follows:

library(SmokingHistoryGenerator)
shg <- new(SHGInterface)

# Local zip path for now (replace with Zenodo URL when published)
zip_path <- "/path/to/usa-national@smok-2018-mort-2016.zip"

N <- 10^5 # Individuals to simulate (REPEAT)
race = 0 # All races combined
sex = 0 # male
cohort_year = 1940
run_cfg <- list(
  params_bundle_source = zip_path,
  params_mortality = "acm",
  individuals = N,
  race = race,
  sex = sex,
  cohort_year = cohort_year
)

# Hydrate parameter tables from config bundle metadata
shg_apply_config(shg, run_cfg)

bundle <- shg$runSim(run_cfg)
RNGSTREAM_SIM <- bundle$results

For a single object that couples simulated rows with original_config, repro_config (full snapshot), and run_info (machine/software audit), call the 6-argument method with attach_run_info = TRUE:

bundle <- shg$runSim(run_cfg)

sim <- bundle$results
cfg_intent <- bundle$original_config
cfg_repro <- bundle$repro_config
audit <- bundle$run_info

Common use cases

1) Apply a sparse config safely (defaults-first)

shg <- new(SHGInterface)
shg_apply_config(shg, list(cohort_year = 1950))

shg_apply_config() resets the instance to factory defaults first, then applies only the keys you supply.

2) Save sparse intent or full repro config with one writer

# Small hand-editable config snippet
shg_write_config_yaml(bundle$original_config, "intent.yml")

# Full replay config
shg_write_config_yaml(bundle$repro_config, "repro.yml")

The same shg_write_config_yaml(config, path) function handles both.

3) Re-run from a full repro config in-memory (no file)

shg2 <- new(SHGInterface)
shg_apply_config(shg2, bundle$repro_config)
sim2 <- shg2$runSim(bundle$repro_config)
sim2_df <- sim2$results

4) Load once, then change cohort for another run

shg3 <- new(SHGInterface)
base_run <- shg_load_config(shg3, "repro.yml")  # applies params + engine settings

# Keep everything else the same, change only cohort year
base_run$cohort_year <- 2000

sim3 <- shg3$runSim(base_run)
sim3_df <- sim3$results

You can also use a pre-generated population instead of using fixed values for race, sex, cohort_year:

If birth_cohort spans many distinct years (as in this illustration), you need full NHIS-style inputs—initiation, cessation, CPD, and mortality tables that include every cohort column your population uses. The trimmed CSVs under inst/extdata/2018 do not cover that; they only bundle a few cohorts for CRAN. Use shg_load_params() or set input_data_folder to a directory with complete tables.

shg <- new(SHGInterface)
# Full tables required for multi-year cohorts—not system.file("extdata", "2018", ...):
shg$input_data_folder <- "/path/to/NHIS-1965-2018/csv-complete"
N <- 10^5 # Individuals to simulate (REPEAT)
pop <- list(
    race = rep(0, N),
    sex = sample(x = c(0, 1), size = N, prob = c(0.5, 0.5), replace = TRUE),
    birth_cohort = rep(1930:1949, N / 20)
)

# The following are default configuration values; change as needed
shg$rng_strategy <- "RngStream"
shg$number_of_segments <- -1  # -1 = auto, or set explicit value for reproducibility
shg$num_threads <- -1         # -1 = auto (all cores), 1 = single-threaded

RNGSTREAM_SIM_POP <- shg$runSimFromDataFrame(pop)

Note on RNG strategies:

  • RngStream (default): Recommended for all use cases, especially multi-segment and parallel simulations. Supports multiple segments and multi-threading while maintaining IID properties.
  • MersenneTwister: Legacy RNG for backward compatibility. Restricted to single-segment, single-threaded execution due to limitations in maintaining IID properties across segments. Attempting to use MersenneTwister with number_of_segments > 1 or num_threads != 1 will result in an error.

If you want to produce identical results as with legacy versions of the SHG command line version (v6.3.5 and earlier), you must select the Mersenne Twister strategy:

library(SmokingHistoryGenerator)
shg <- new(SHGInterface)
N <- 10^5 # Individuals to simulate (REPEAT)
# If you want to produce identical results as previous versions of the legacy CLI you must set the following properties:
shg$rng_strategy <- "MersenneTwister"
# Note: MersenneTwister is automatically restricted to 1 segment and non-parallel execution
MT_SIM <- shg$runSimFromFixedValues(N, 0, 0, 1940)

CPD Output Format

The cpd_format property controls how cigarettes-per-day data is returned:

shg$cpd_format <- "sparse"  # Default - fastest with CPD: "20, 20, 10, 3"
shg$cpd_format <- "none"    # Fastest - no CPD column returned
shg$cpd_format <- "legacy"  # Backwards compatible: "17 (20), 18 (20), 19 (10)"

Note: The sparse format stores only CPD values. The age can be computed as init_age + index since values are sequential from initiation age.

File Output Mode

For CLI-like performance, you can write rows directly to disk. With the bundled return form, the in-memory object still includes configs and audit metadata, but not the full simulated row set (to conserve memory):

library(SmokingHistoryGenerator)
shg <- new(SHGInterface)

run_cfg <- list(
  params_bundle_source = "/path/to/usa-national@smok-2018-mort-2016.zip",
  params_mortality = "acm",
  cohort_year = 1950,
  output_file = "/path/to/output-fixed.csv"
)

# Load parameters from config metadata, then run
bundle <- shg$runSim(run_cfg)

# Same bundle structure; output rows are in output-fixed.csv
# Defaults used here: individuals = 1000, race = 0, sex = 0.
# bundle$original_config / bundle$repro_config / bundle$run_info are returned

File output matches CLI's data format (semicolon-separated).

Random number generator seeds

Set seeds on the SHGInterface before running (for example shg$seed_init, shg$seed_cess, shg$seed_mortality, shg$seed_misc). Use getReproConfig() after a run to inspect the effective values used. See ?SHGInterface and ?getReproConfig.

Configuration management

Use shg_apply_config() for intent-oriented updates, getConfig() / useConfig() to read or replace settings, and shg_write_config_yaml() / shg_load_config() to save or reload portable YAML for exact reruns.

Contributors

The Smoking History Generator CLI (Command Line Interface) was developed in the early 2000s and maintained by several contributors since that time.

  • Original author: Martin Krapcho
  • Contributors: Ben Racine, Alexander Gaenko, John Clarke
  • R package wrapper author: John Clarke
  • Maintainer: John Clarke
  • NCI contact: Rocky Feuer

Publications

You can find a complete set of publications about the Smoking History Generator via CISNET and project-specific resource pages linked from there.

Funding

Funding for the CISNET Smoking History Generator and the Rcpp wrapper came from the following National Cancer Institute (NCI) grants.

  • U01CA253858
  • U01CA199284
  • U01CA152956
  • U01CA097415

License

You may not use the Software or Datasets for commercial purposes without prior written consent from the CISNET Lung Working Group and without entering into a separate license agreement regarding such commercial use. Contact: Rafael Meza Rodriguez rmeza@bccrc.ca and Jamie Tam jamie.tam@yale.edu.

The software is released under the GPL-3. The test input tables shipped with the package are released under the CC BY-SA 4.0 license.

Copyright Notice

© 2026 CISNET Lung Working Group. All rights reserved.

About

An R wrapper for the CISNET Smoking History Generator

Topics

Resources

License

Stars

Watchers

Forks

Packages

 
 
 

Contributors