Skip to content

MohammedAliSharafuddin/small-sample-survey-framework

Repository files navigation

Small-Sample Survey Research: Simulation-Validated Decision Framework

Project Overview

This repository contains a reproducible R-based research workflow for evaluating statistical methods in small-sample survey research. The central objective is to develop a simulation-validated decision framework for studies with sample sizes below 30, where conventional large-sample assumptions are often implausible and method choice can materially alter substantive conclusions.

The current implementation focuses on two analytical settings that recur in small-sample survey work:

  • two-group comparisons
  • association analysis

A regression block is planned as a later phase of the project and is not part of the current manuscript-focused release.

Research Problem: Misuse of Large-Sample Methods in Small Samples

Small-sample survey studies are common in specialised populations, pilot evaluations, classroom studies, organisational diagnostics, and early-stage intervention research. In these settings, analysts frequently rely on procedures that were derived or justified under asymptotic conditions. When those conditions are not met, nominal error rates, interval coverage, and inferential stability may degrade in ways that are not transparent from a single applied analysis.

The practical problem is not only low power. It is also method selection under uncertainty: different combinations of outcome scale, distributional shape, and noise structure may favour different inferential strategies.

Why This Matters in Survey Research

Survey researchers regularly analyse:

  • short scales with limited response categories
  • skewed attitudinal or behavioural outcomes
  • modest subgroup comparisons
  • correlation-based evidence in exploratory instrument work

Under these conditions, the choice among parametric, nonparametric, bootstrap, and Bayesian approaches is rarely neutral. A method that performs adequately for approximately normal interval responses may behave differently for skewed Likert-type data with the same nominal sample size. This repository treats that choice as an empirical design problem rather than a matter of convention.

Methodological Contribution

The project contributes a structured simulation workflow that compares candidate methods across a controlled factorial design:

  • sample sizes: 10, 20, 30
  • data-generating distributions: normal and skewed
  • effects: null and moderate
  • measurement scales: interval and Likert
  • noise conditions: low and high

Primary performance criteria:

  • Type I error
  • power
  • bias
  • confidence interval coverage
  • Monte Carlo standard errors for the estimated performance criteria

Descriptive summaries reported alongside the primary criteria:

  • mean estimates
  • mean p-values
  • mean Bayes factors

The current analytical blocks are:

  1. Block A: Welch t-test, Mann-Whitney U, Bayesian t-test, bootstrap confidence interval
  2. Block B: Pearson correlation, Spearman correlation, Bayesian correlation, bootstrap confidence interval

Design parameterisation follows the current manuscript protocol:

  • Block A holds the standardized group difference constant at Cohen's d = 0.50 under the moderate-effect condition, with the raw mean shift scaled by the scenario-specific outcome standard deviation.
  • Block B holds the latent correlation at 0.35 under the moderate-effect condition; observed-scale truth values are re-estimated after measurement noise is added, so attenuation is treated as part of the data-generating process rather than ignored.

Repository Structure

small-sample-survey-framework/
├─ README.md
├─ LICENSE
├─ .gitignore
├─ renv.lock
├─ small-sample-survey-framework.Rproj
├─ data/
├─ output/
│  ├─ tables/
│  ├─ figures/
│  ├─ logs/
│  └─ derived/
├─ scripts/
├─ functions/
├─ quarto/
├─ manuscript/
└─ docs/

How to Run the Project

  1. Open small-sample-survey-framework.Rproj in RStudio or use the project root in a terminal session.
  2. Run Rscript scripts/99_run_all.R.
  3. Review generated outputs in output/tables, output/figures, and output/derived.

The codebase supports two explicit execution modes:

  • development mode is the default and is intended for fast verification of the pipeline. It currently uses 20 replications, 199 bootstrap resamples, and 2000 truth draws. These committed outputs are starter artefacts, not manuscript-scale results.
  • manuscript mode is intended for substantive reporting. It defaults to 2000 replications, 1999 bootstrap resamples, and 10000 truth draws. In manuscript mode, the setup script enforces a minimum of 999 bootstrap replications.

Environment variables that control a run:

  • SMALL_SAMPLE_RUN_MODE
  • SMALL_SAMPLE_N_REPS
  • SMALL_SAMPLE_BOOT_REPS
  • SMALL_SAMPLE_TRUTH_DRAWS
  • SMALL_SAMPLE_SEED
  • SMALL_SAMPLE_BF_THRESHOLD
  • SMALL_SAMPLE_BF_SENSITIVITY_THRESHOLD

Dependency snapshots are intentionally separate from the main pipeline. If package versions were changed deliberately, run Rscript scripts/98_snapshot_environment.R to update renv.lock.

Outputs

The workflow is designed to produce:

  • scenario-level simulation summaries as CSV tables
  • reproducible RDS objects containing raw and aggregated simulation results
  • truth tables that document the observed-scale estimands under each scenario
  • Bayes-factor threshold sensitivity tables for thresholds of 3 and 10
  • comparison figures in PNG and PDF formats
  • execution logs for transparent run tracking

Reproducibility Statement

This repository is structured as a reproducible research project and was developed against R 4.5.3. All simulations, summaries, and figures are generated from source scripts. Project dependencies are managed with renv, paths are project-relative, and random seeds are set explicitly for the simulation and bootstrap components.

The current workflow requests Bayes factors only and does not request posterior samples from BayesFactor. In the package version recorded by renv.lock, ttestBF() computes Bayes factors by Gaussian quadrature and correlationBF() computes Bayes factors through deterministic numerical routines when posterior = FALSE. Reproducibility therefore depends on fixed package versions and numerical libraries rather than on MCMC output.

Planned Journal Submission

The repository is being developed as the computational companion to a methods paper on statistical decision-making for small-sample survey studies. The intended manuscript will report the simulation design, performance criteria, Monte Carlo precision, practical decision rules, and an applied validation phase using empirical survey data.

Citation Placeholder

Formal citation metadata will be added upon preprint release or journal submission. Until then, please cite the repository by title and URL.

About

A simulation-based framework for statistical inference in small-sample survey research (n < 30), including parametric, nonparametric, bootstrap, and Bayesian methods.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages