bigMICE

bigMICE is an R package based on the sparklyr library, designed for handling large datasets with multiple imputation using an efficient and scalable approach.

Setup and recommendations

Spark and sparklyr

The following commands can be run to set up an environment for a new project, and running them is optional.

install.packages("renv")
library(renv)
renv::init()

Install sparklyr and spark (run once) with the following commands. If not using the latest version of sparklyr, make sure to install a compatible Spark version, and vice versa. For the latest sparklyr release (1.9.1), the compatible Spark version is 4.0.0. For sparklyr versions < 1.9.0, you will need a spark version < 4.0.0.

install.packages("sparklyr") # version 1.9.1
options(timeout = 6000)
library(sparklyr)
spark_install(version="4.0.0")

To check that the correct combination of Spark and sparklyr have been installed, use the following two commands:

sparklyr::spark_installed_versions()
utils::packageVersion("sparklyr")

Hadoop

For robust execution of Spark on big data sets, checkpointing can be needed. To make it possible to enable checkpointing, Hadoop needs to be installed. For smaller datasets or for running toy examples Hadoop installation can be skipped.

On Linux https://hadoop.apache.org/docs/r1.2.1/single_node_setup.html

On Windows https://gist.github.com/vorpal56/5e2b67b6be3a827b85ac82a63a5b3b2e

Note that specific Java versions are needed to run Spark: https://spark.apache.org/docs/latest/ (JDK 17 or JDK 21 at the moment of writing)

Installation

To install bigMICE from GitHub, use the following commands in R:

# Install devtools if not already installed
install.packages("devtools")

# Install bigMICE from GitHub
devtools::install_github("bigcausallab/bigMICE")

Once installed, load the package:

library(bigMICE)

Example Usage

Loading necessary libraries:

library(bigMICE)
library(dplyr)
library(sparklyr)

Creating a local Spark session.

conf <- spark_config()
conf$`sparklyr.shell.driver-memory`<- "10G"
conf$spark.memory.fraction <- 0.8
conf$`sparklyr.cores.local` <- 4
#conf$`spark.local.dir` <- "/local/data/spark_tmp/" # needed for checkpointing.
# If not possible, add the parameter checkpointing = FALSE to the mice.spark call

sc = spark_connect(master = "local", config = conf)

Download the dataset boys.rda from the mice R package here and then save it to the current working R directory. After that, run the following commands.

# Loading the data
data <- load("boys.rda")
write.csv(boys, "data.csv", row.names = FALSE)
sdf <- spark_read_csv(sc, "data", "data.csv", header = TRUE, infer_schema = TRUE, null_value = "NA") %>%
  select(-all_of(c("hgt","wgt","bmi","hc")))
# preparing the elements before running bigMICE
variable_types <- c(age = "Continuous_float", 
                   gen = "Nominal", 
                   phb = "Nominal",
                   tv = "Continuous_int",
                   reg = "Nominal")

analysis_formula <- as.formula("phb ~ age + gen + tv + reg")

Call the mice.spark function to obtain m=1 imputed dataset:

imputation_results <- bigMICE::mice.spark(data = sdf,
                                            sc = sc,
                                variable_types = variable_types,
                              analysis_formula = analysis_formula,
                               predictorMatrix = NULL,
                                             m = 1,
                                         maxit = 2,
                                 checkpointing = FALSE)

print(imputation_results)

Name		Name	Last commit message	Last commit date
Latest commit History 162 Commits
.github		.github
R		R
man		man
.Rbuildignore		.Rbuildignore
.gitignore		.gitignore
DESCRIPTION		DESCRIPTION
NAMESPACE		NAMESPACE
README.md		README.md
bigMice.Rproj		bigMice.Rproj

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

bigMICE

Setup and recommendations

Spark and sparklyr

Hadoop

Installation

Example Usage

About

Uh oh!

Releases 1

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

bigMICE

Setup and recommendations

Spark and sparklyr

Hadoop

Installation

Example Usage

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases 1

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages