Skip to content

mhkc/bonsai-test-data

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

22 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Bonsai Test Data

Overview

This repository contains curated test dataset for Bonsai.

It is intended to support:

  • Local development environments
  • Integration and end-to-end testing
  • Demo instances
  • PRP‑driven sample uploads
  • Database seeding during bootstrap

This repo is not part of any single microservice. It serves as a shared, versioned source of truth for reproducible Bonsai test environments.

Docker Init Container

This repository also provides a Docker image that Bonsai environments use to mount test data.

Build locally

docker build -t bonsai-test-data:local .

Usage in Docker Compose (Dev/E2E)

init-test-data:
  image: ghcr.io/clinicalgenomicslund/bonsai-test-data:v0.1.0
  volumes:
    - testdata:/mnt/testdata
  command: ["sh", "-c", "cp -r /dataset/* /mnt/testdata"]

Reanalyze datasets

Updates to JASEN can require the test data to be reanalyzed. Here are the steps to redownload the data and recompute the results.

Note See JASEN docs for installation instructions and how to run it.

Download the datasets if needed.

./scripts/download_fastqs.sh -i bioprojects/PRJEB77209.illumina.tsv

Create a JASEN input file using the path to the downloaded fastq files.

./scripts/make_jasen_input.sh              \
    -i bioprojects/PRJEB77209.illumina.tsv \
    -f /path/to/fastq/                     \
    -o /output/dir/

Then run JASEN to produce the output files.

NOTE: You have to add and assay column using the SMD convenience start_nextflow_analysis to run JASEN.

nextflow run main.nf                                      \
        -profile staphylococcus_aureus,illumina,apptainer \
        -config nextflow.config                           \
        --csv /output/dir/PRJEB77209.illumina.csv

Copy the files to the repo as either a new pipline version or overwrite existing result.

# if relevant change the version of jasen
jasen_version=1.2.0
resultPath=/fs1/results_dev/jasen/saureus
targetDir="/path/to/repo/results/v${jasen_version}/saureus"

# find all new result files
mkdir -p "${targetDir}"
tail -n +2 PRJEB77209.illumina.csv | awk -F',' '{print $1}' | while read -r id; do
  cd "$resultPath"
  find . -name "${id}*" -exec echo cp -R --parents {} "$targetDir" \;
done

Finally subset large files to reduce repo size and reindex bam indexes.

cd "${targetDir}"

find . -name '*.bam' -print0 | while IFS= read -r -d '' line; do
  # Create downsampled BAM
  samtools view -b -s 0.01 "$line" -o "${line}.mini" || continue

  # Replace original only if the new file was created successfully
  if [ -s "${line}.mini" ]; then
    rm -- "$line" && mv -- "${line}.mini" "$line"
  else
    echo "Warning: downsampled file is empty or missing for: $line" >&2
    rm -f -- "${line}.mini"
 fi
  samtools index "$line"
done

find . -name '*.fasta' -exec gzip {} \;

About

Test data for bootstrapping a demo instance of Bonsai

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages