This repository contains a reproducible R-based research workflow for evaluating statistical methods in small-sample survey research. The central objective is to develop a simulation-validated decision framework for studies with sample sizes below 30, where conventional large-sample assumptions are often implausible and method choice can materially alter substantive conclusions.
The current implementation focuses on two analytical settings that recur in small-sample survey work:
- two-group comparisons
- association analysis
A regression block is planned as a later phase of the project and is not part of the current manuscript-focused release.
Small-sample survey studies are common in specialised populations, pilot evaluations, classroom studies, organisational diagnostics, and early-stage intervention research. In these settings, analysts frequently rely on procedures that were derived or justified under asymptotic conditions. When those conditions are not met, nominal error rates, interval coverage, and inferential stability may degrade in ways that are not transparent from a single applied analysis.
The practical problem is not only low power. It is also method selection under uncertainty: different combinations of outcome scale, distributional shape, and noise structure may favour different inferential strategies.
Survey researchers regularly analyse:
- short scales with limited response categories
- skewed attitudinal or behavioural outcomes
- modest subgroup comparisons
- correlation-based evidence in exploratory instrument work
Under these conditions, the choice among parametric, nonparametric, bootstrap, and Bayesian approaches is rarely neutral. A method that performs adequately for approximately normal interval responses may behave differently for skewed Likert-type data with the same nominal sample size. This repository treats that choice as an empirical design problem rather than a matter of convention.
The project contributes a structured simulation workflow that compares candidate methods across a controlled factorial design:
- sample sizes: 10, 20, 30
- data-generating distributions: normal and skewed
- effects: null and moderate
- measurement scales: interval and Likert
- noise conditions: low and high
Primary performance criteria:
- Type I error
- power
- bias
- confidence interval coverage
- Monte Carlo standard errors for the estimated performance criteria
Descriptive summaries reported alongside the primary criteria:
- mean estimates
- mean p-values
- mean Bayes factors
The current analytical blocks are:
- Block A: Welch t-test, Mann-Whitney U, Bayesian t-test, bootstrap confidence interval
- Block B: Pearson correlation, Spearman correlation, Bayesian correlation, bootstrap confidence interval
Design parameterisation follows the current manuscript protocol:
- Block A holds the standardized group difference constant at Cohen's d = 0.50 under the moderate-effect condition, with the raw mean shift scaled by the scenario-specific outcome standard deviation.
- Block B holds the latent correlation at 0.35 under the moderate-effect condition; observed-scale truth values are re-estimated after measurement noise is added, so attenuation is treated as part of the data-generating process rather than ignored.
small-sample-survey-framework/
├─ README.md
├─ LICENSE
├─ .gitignore
├─ renv.lock
├─ small-sample-survey-framework.Rproj
├─ data/
├─ output/
│ ├─ tables/
│ ├─ figures/
│ ├─ logs/
│ └─ derived/
├─ scripts/
├─ functions/
├─ quarto/
├─ manuscript/
└─ docs/
- Open
small-sample-survey-framework.Rprojin RStudio or use the project root in a terminal session. - Run
Rscript scripts/99_run_all.R. - Review generated outputs in
output/tables,output/figures, andoutput/derived.
The codebase supports two explicit execution modes:
developmentmode is the default and is intended for fast verification of the pipeline. It currently uses 20 replications, 199 bootstrap resamples, and 2000 truth draws. These committed outputs are starter artefacts, not manuscript-scale results.manuscriptmode is intended for substantive reporting. It defaults to 2000 replications, 1999 bootstrap resamples, and 10000 truth draws. In manuscript mode, the setup script enforces a minimum of 999 bootstrap replications.
Environment variables that control a run:
SMALL_SAMPLE_RUN_MODESMALL_SAMPLE_N_REPSSMALL_SAMPLE_BOOT_REPSSMALL_SAMPLE_TRUTH_DRAWSSMALL_SAMPLE_SEEDSMALL_SAMPLE_BF_THRESHOLDSMALL_SAMPLE_BF_SENSITIVITY_THRESHOLD
Dependency snapshots are intentionally separate from the main pipeline. If package versions were changed deliberately, run Rscript scripts/98_snapshot_environment.R to update renv.lock.
The workflow is designed to produce:
- scenario-level simulation summaries as CSV tables
- reproducible RDS objects containing raw and aggregated simulation results
- truth tables that document the observed-scale estimands under each scenario
- Bayes-factor threshold sensitivity tables for thresholds of 3 and 10
- comparison figures in PNG and PDF formats
- execution logs for transparent run tracking
This repository is structured as a reproducible research project and was developed against R 4.5.3. All simulations, summaries, and figures are generated from source scripts. Project dependencies are managed with renv, paths are project-relative, and random seeds are set explicitly for the simulation and bootstrap components.
The current workflow requests Bayes factors only and does not request posterior samples from BayesFactor. In the package version recorded by renv.lock, ttestBF() computes Bayes factors by Gaussian quadrature and correlationBF() computes Bayes factors through deterministic numerical routines when posterior = FALSE. Reproducibility therefore depends on fixed package versions and numerical libraries rather than on MCMC output.
The repository is being developed as the computational companion to a methods paper on statistical decision-making for small-sample survey studies. The intended manuscript will report the simulation design, performance criteria, Monte Carlo precision, practical decision rules, and an applied validation phase using empirical survey data.
Formal citation metadata will be added upon preprint release or journal submission. Until then, please cite the repository by title and URL.