diff --git a/README.md b/README.md index f350a2b..4347126 100644 --- a/README.md +++ b/README.md @@ -1,272 +1,284 @@ - [![GitHub Downloads](https://img.shields.io/github/downloads/smithlabcode/abismal/total?style=social)](https://github.com/smithlabcode/abismal/releases/latest) [![Install with Conda](https://anaconda.org/bioconda/abismal/badges/platforms.svg)](https://anaconda.org/bioconda/abismal) [![Install with Conda](https://img.shields.io/conda/dn/bioconda/abismal?color=red&label=conda%20downloads&style=flat-square)](https://anaconda.org/bioconda/abismal) -## abismal ## +# Abismal -**A**nother **Bis**ulfite **M**apping **Al**gorithm (abismal) is -a read mapping program for bisulfite sequencing in DNA methylation -studies. +**A**nother **Bis**ulfite **M**apping **Al**gorithm (abismal) is a read +mapping program for bisulfite sequencing in DNA methylation studies. -Download the latest stable release -[here](https://github.com/smithlabcode/abismal/releases). +* Binaries for [macOS](https://github.com/smithlabcode/abismal/releases/download/v3.3.0/abismal-macOS.tar.gz) and [Linux](https://github.com/smithlabcode/abismal/releases/download/v3.3.0/abismal-Linux.tar.gz). +* Latest stable [release](https://github.com/smithlabcode/abismal/releases/latest). +* [Quickstart](#quickstart) +* [Examples](#examples) +* Full documentation [here](https://github.com/smithlabcode/abismal/blob/master/docs/MANUAL.md) (2025-06-13 needs an update) +* Building from [source](#building-from-source) -See how to [get started](https://github.com/smithlabcode/abismal/blob/master/docs/MANUAL.md#quick-installation) and the -[full program documentation](https://github.com/smithlabcode/abismal/blob/master/docs/MANUAL.md). +## Quickstart -### Requirements ### +* Download + [macOS](https://github.com/smithlabcode/abismal/releases/download/v3.3.0/abismal-macOS.tar.gz) + or + [Linux](https://github.com/smithlabcode/abismal/releases/download/v3.3.0/abismal-Linux.tar.gz) + binaries, install through [conda](https://anaconda.org/bioconda/abismal) or + [source](https://github.com/smithlabcode/abismal/releases/download/v3.3.0/abismal-3.3.0.tar.gz). -Currently abismal requires a C++ compiler. The default compiler -assumed is g++ (comes with GCC, available on your Linux or macOS -machine). HTSlib is also required. Instructions to get HTSlib, for -macOS or Linux systems, can be found below. +* Make an index for your reference genome (assuming human hg38): -If you have trouble with the `make` part of the installation procedure -described below, please contact us via e-mail or through a [GitHub -issue](https://github.com/smithlabcode/abismal/issues). + ```console + ./abismal idx hg38.fa hg38.idx + ``` -### Documentation ### +* Map reads using that index: -The full documentation for abismal can be found -[here](https://github.com/smithlabcode/abismal/blob/master/docs/MANUAL.md). This -explains the use of each parameter in full detail. Below, after -installation instructions, we describe the most common use cases: -indexing a genome and mapping single-end and paired-end reads. + ```console + ./abismal map -i hg38.idx -o reads.sam reads_1.fq reads_2.fq + ``` -### Installation on Linux ### +* See + [documentation](https://github.com/smithlabcode/abismal/blob/master/docs/MANUAL.md) + for mapping options and output formats. -These instructions are for building abismal from source, rather than -obtaining it through a package manager like conda. +## Examples -These instructions assume you have access to `apt` which is installed -on Ubuntu-based and Debian-based distributions. The only difference -for other linux distributions is how you get the dependencies. Likely -all you need is: -```console -$ sudo apt-get install -y libhts-dev -``` -If you don't have adminstrator privileges, there are other options. -If you have the `libhts-dev` installed, to build `abismal` the -following should work: -```console -$ wget https://github.com/smithlabcode/abismal/releases/download/v3.2.4/abismal-3.2.4.tar.gz -$ tar -zxvf abismal-3.2.4.tar.gz -$ cd abismal-3.2.4 -$ mkdir build && cd build -$ ../configure --prefix=/where/you/want/abismal -$ make -$ make install -``` -Be sure that you have permissions to write files to -`/where/you/want/abismal`. This will install `abismal`, `abismalidx` -and `simreads` inside the `bin` directory of the specified location. - -### Installation on macOS ### - -The GitHub repo for abismal includes tests that run on macOS 13 -(Ventura), and we use the following steps. Although our tests begin -with a "fresh" macOS installation, they have certain tools already -available. In particular, [Homebrew](https://brew.sh) is already -available and possibly some other tools. Homebrew is necessary as the -first step to get the tools and dependencies: -```console -$ brew update -$ brew install gcc -$ brew install htslib gsl -$ brew list --versions gcc -``` -At this point, keep the version of `gcc` in mind, because it will probably -be needed below. If you don't already have `abismal` downloaded, the next -step is to download it. Here we will assume you are using a release rather -than a clone. To build from a clone involves at least one more step. -```console -$ wget https://github.com/smithlabcode/abismal/releases/download/v3.2.4/abismal-3.2.4.tar.gz -$ tar -zxvf abismal-3.2.4.tar.gz -$ cd abismal-3.2.4 -``` -Finally, these steps build the software: -```console -$ mkdir build && build -$ ../configure \ - --prefix=/path/to/install \ - CXX="g++-13" \ - CPPFLAGS="-I$(brew --prefix)/include" \ - LDFLAGS="-L$(brew --prefix)/lib" -$ make -$ make install -``` -Notice the `g++-13` in the `../configure` command. This is the version -number referenced above. If you have a different version number (e.g., -when gcc-14 is the default), you will need to update that number to -correspond to the major version number. Be sure you have permissions -to write to the directory `/path/to/install`. - -### How to get the dependencies through conda ### - -If you are on linux and do not have adminstrator privileges to get the -dependencies (e.g., HTSlib), you can get them either by building them -directly from source, or through conda. In particular, for obtaining HTSlib -through conda, do the following: -``` -$ conda install -c bioconda htslib -``` -as explained here at [htslib](https://anaconda.org/bioconda/htslib). -I used conda obtained through miniconda3, which means the default -location for HTSlib to be installed is `~/miniconda3` and then inside -the `lib` and `include` subdirectores. So once this is done, you can -build `abismal` by replacing the `configure` step in the earlier -explanations by -```console -../configure --prefix=/path/to/install \ - CPPFLAGS="-I${HOME}/miniconda3/include" \ - LDFLAGS="-L${HOME}/miniconda3/lib" -``` -Note that you can use this approach with both Linux or macOS, but in -the case of macOS you can replace the `LDFLAGS` and `CPPFLAGS` for -conda, but keep the `CXX` variable. Remember not to use tilde (`~`) in -place of the `${HOME}` variable above. It might work, but shouldn't. - -### Installation from a clone of the repo ### - -This method is likely only useful if you need the most recent update, -and is not recommended for most users. The only difference from the -above explanations for linux and macos is that you will need to clone -the repo, which means you need `git` installed, and you will also need -to build the sources in place and without much reporting in case of any -problems. -```console -$ cd /where/you_want/the_code -$ git clone --recursive git@github.com:smithlabcode/abismal.git -$ cd abismal -$ make -$ make install -``` -If you are building from the source in a cloned repo, you will likely -see other ways to accomplish it by examining the files in the root of -the repo. +* Make an index from a reference genome in a single FASTA file (here, `hg38.fa`): -### Indexing the genome ### + ```console + ./abismal idx hg8.fa hg38.idx + ``` -The index can be constructed as follows, based on a genome existing -entirely in a single FASTA format file: -``` -$ abismalidx -``` +* Make an index using 8 threads: -### Bisulfite mapping ### + ```console + ./abismal idx -t 8 hg38.fa hg38.idx + ``` -single-end reads -``` -$ abismal [options] -i -o -``` -paired-end reads -``` -$ abismal [options] -i -o -``` +* Map single-end reads to hg38 using an index: -### abismal options ### - -|option|long version |arg type |default | description | -|:-----|:----------------|:--------|------------------:|:--------------------------------------| -| -i | -index | string | | genome index file | -| -g | -genome | string | | genome file (FASTA) | -| -o | -outfile | string | | output file (default SAM format) | -| -s | -stats | string | | mapping statistics output file (YAML) | -| -c | -max-candidates | integer | 100 | max candidates per seed* | -| -l | -min-frag | integer | 32 | minimum fragment length (PE mode) | -| -L | -max-frag | integer | 3000 | maximum fragment length (PE mode) | -| -m | -max-distance | double | 0.1 | max relative number of errors | -| -a | -ambig | boolean | | report a position for ambiguous reads | -| -P | -pbat | boolean | | input follows the PBAT protocol | -| -R | -random-pbat | boolean | | input follows the random PBAT protocol| -| -A | -a-rich | boolean | | reads are A-rich (SE mode) | -| -t | -threads | integer | 1 | number of mapping threads | -| -v | -verbose | boolean | | print more run info | -| -B | -bam | boolean | output SAM format | write output in BAM format | - -\* the max candidates parameter controls the amount of "effort" in -mapping. In the "sensitive" step, which aligns reads with smaller -exact match seeds, abismal skips seeds that retrieves more than `c` -candidates. The higher the value of `c`, the more alignments abismal -performs. Note that abismal still aligns reads to every exact match -hit that spans more than half of the read ("specific step"). The -specific step does not change with the value set by `c`. - -### Examples ### -(1) **Indexing the genome** - -To make an index for hg38: -``` -$ abismalidx hg38.fa hg38.abismalidx -``` -In the process of building the index, the names of chromosomes will be -truncated at the first whitespace character. + ```console + ./abismal map -i hg38.idx -o reads.sam reads.fq + ``` -(2) **Bisulfite Mapping** +* Map single-end reads with 64 cores (and get very close to 64x speedup): -To map single-end reads in file `reads.fq` to human genome hg38: -``` -$ abismal -i hg38.abismalidx -o reads.sam reads.fq -``` + ```console + ./abismal map -i hg38.idx -o reads.sam -t 64 reads.fq + ``` -To map paired-end reads in files `reads-1.fq` and `reads-2.fq` to human genome hg38: -``` -$ abismal -i hg38.abismalidx -o reads.sam reads-1.fq reads-2.fq -``` +* Map paired-end reads in `reads_1.fq` and `reads_2.fq`: -To map reads in BAM format: -``` -$ abismal -B -i hg38.abismalidx -o reads.bam reads.fq -``` + ```console + ./abismal map -i hg38.idx -o reads.sam reads_1.fq reads_2.fq + ``` -To map reads to human genome without requiring a separate index file -(i.e. run both indexing and mapping simultaneously): -``` -$ abismal -g hg38.fa -o reads.sam reads.fq +* Get output in BAM format: + + ```console + ./abismal map -i hg38.idx -B -o reads.bam reads.fq + ``` + +* Get mapping statistics in YAML format: + + ```console + ./abismal map -i hg38.idx -s reads.stats.yaml -o reads.sam reads.fq + ``` + +* Skip making the index (make it on the fly): + + ```console + ./abismal map -g hg38.fa -o reads.sam reads.fq + ``` + +* Map reads from PBAT data: + + ```console + ./abismal map -i hg38.idx -P -o reads.sam reads.fq + ``` + +* Map reads from random PBAT data: + + ```console + ./abismal map -i hg38.idx -R -o reads.sam reads.fq + ``` + +Mapping results are reported in SAM format. Some choices in the output are +explicitly highlighted below: + + * Reads are output identically to how they appear in the input FASTQ files, + regardless of mapped strand. + * the `NM` tag reports the edit distance between the read and the output, + specifically the sum of mismatches, insertions and deletions to the best + mapping position. + * The `CV` tag reports the assumed bisulfite base used to map the read. Reads + mapped as A-rich will be reported with `CV:A:A`, and reads mapped as T-rich + will be reported with `CV:A:T`. This tag is independent of the strand the + read was mapped to. If reads are not mapped in PBAT or random PBAT mode, + the first end will always be T-rich and the second end will always be + A-rich. + +## Building from source + +If you are here because the binaries don't work for you, please let us know +and we'll try to fix that. + +### Linux + +These instructions have been tested for Ubuntu 24.04 and Fedora 41. They will +likely work on most APT and RPM-based distributions in 2025. + +Here are the basic commands if you are ready to build: + +```console +wget https://github.com/smithlabcode/abismal/releases/download/v3.3.0/abismal-3.3.0.tar.gz +tar -xf abismal-3.3.0.tar.gz +cd abismal-3.3.0 +./configure --prefix=${HOME} +make +make install ``` -Mapping results are reported in SAM format. Some choices in the output -are explicitly highlighted below: - * Reads are output identically to how they were read, regardless of - mapped strand. - * the `NM` tag reports the edit distance between the read and the - output, specifically the sum of mismatches, insertions and - deletions to the best mapping position. - * The `CV` tag reports the assumed bisulfite base used to map the - read. Reads mapped as A-rich will be reported with `CV:A:A`, and - reads mapped as T-rich will be reported with `CV:A:T`. This tag is - independent of the strand the read was mapped to. If reads are not - mapped in PBAT or random PBAT mode, the first end will always be - T-rich and the second end will always be A-rich. - -### Contacts ### - -***Andrew D Smith*** *andrewds@usc.edu* -***Guilherme Sena*** *desenabr@usc.edu* - -### Citation ### +If you that doesn't work, then check that you have the right dependencies. You +can find them in the lists below, depending on your system. + +* Dependencies: + + Ubuntu/Debian + + ```console + apt-get update && \ + DEBIAN_FRONTEND=noninteractive \ + apt-get install -y --no-install-recommends \ + ca-certificates \ + g++ \ + make \ + zlib1g-dev \ + libhts-dev \ + automake \ + git \ + wget + ``` + + Fedora/Red Hat + + ```console + dnf update -y && \ + dnf install -y \ + g++ \ + make \ + zlib-devel \ + libhts-devel \ + wget \ + automake \ + awk \ + git + ``` + + The wget is only needed if for the next step, and the automake and git (and + awk for Fedora/Red Hat) are only needed for the subsequent step using a + clone. Your machine likely has these already. + + If you don't have admin privileges on your system, you can use Conda to get + all the dependencies. Assuming you don't already have conda installed, this + will get everything you need: + + ```console + wget https://github.com/conda-forge/miniforge/releases/latest/download/Miniforge3-Linux-x86_64.sh && \ + sh Miniforge3-Linux-x86_64.sh -bsup ${HOME}/miniforge3 && \ + export PATH=${HOME}/miniforge3/bin:$PATH && \ + conda install -y \ + conda-forge::binutils \ + conda-forge::gxx \ + conda-forge::zlib \ + conda-forge::make \ + conda-forge::automake \ + conda-forge::git \ + bioconda::htslib + ``` + +* Build from a source release and install in your home directory: + + ```console + wget https://github.com/smithlabcode/abismal/releases/download/v3.3.0/abismal-3.3.0.tar.gz + tar -xf abismal-3.3.0.tar.gz + cd abismal-3.3.0 + ./configure --prefix=${HOME} + make + make install + ``` + +* Build from a clone and install in your home directory: + + ```console + git clone --recursive https://github.com/smithlabcode/abismal.git + cd abismal + ./autogen.sh + ./configure --prefix=${HOME} + make + make install + ``` + +### macOS + +* Dependencies: + + Get a compiler through [xcode](https://developer.apple.com/xcode) or get gcc + from Homebrew by adding `gcc` to the list below. + + ```console + brew update && \ + brew install \ + zlib \ + htslib \ + automake \ + git \ + wget + ``` + + The wget is only needed if for the next step, and the automake and git are + only needed for the subsequent step using a clone. I tested these steps on + GitHub runners for macOS-15 so you might need additional dependencies. + + Conda as explained above will also work, but you need the conda installer + script for your macOS system. + +* Build from a source release and install in your home directory: + + ```console + wget https://github.com/smithlabcode/abismal/releases/download/v3.3.0/abismal-3.3.0.tar.gz + tar -xf abismal-3.3.0.tar.gz + cd abismal-3.3.0 + ./configure CPPFLAGS="-I$(brew --prefix)/include" LDFLAGS="-L$(brew --prefix)/lib" --prefix=${HOME} + make + make install + ``` + +* Build from a clone and install in your home directory: + + ```console + git clone --recursive https://github.com/smithlabcode/abismal.git + cd abismal + ./autogen.sh + ./configure CPPFLAGS="-I$(brew --prefix)/include" LDFLAGS="-L$(brew --prefix)/lib" --prefix=${HOME} + make + make install + ``` + +If you build from source you can enable a mode for mapping very short reads. +Using such reads is discouraged and rarely helpful. Use the `--help` argument +to the configure script to see how to enable this option. + +## Contacts + +Andrew D Smith andrewds@usc.edu + +## Citation + The abismal manuscript is available -[here](https://doi.org/10.1093/nargab/lqab115). -If you used `abismal` to analyze your data, please cite us as follows. +[here](https://doi.org/10.1093/nargab/lqab115). If you used `abismal` to +analyze your data, please cite: + ``` de Sena Brandine, G., & Smith, A. D. (2021). Fast and memory-efficient mapping of short bisulfite sequencing reads using a two-letter alphabet. NAR Genomics and Bioinformatics, 3(4), lqab115. ``` - -### Copyright ### - -Copyright (C) 2018-2023 Andrew D. Smith and Guilherme de Sena Brandine - -Authors: Andrew D. Smith and Guilherme de Sena Brandine - -abismal is free software: you can redistribute it and/or modify it under -the terms of the GNU General Public License as published by the Free -Software Foundation, either version 3 of the License, or (at your -option) any later version. - -abismal is distributed in the hope that it will be useful, but WITHOUT -ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or -FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License -for more details. diff --git a/configure.ac b/configure.ac index 0fcd159..28614e0 100644 --- a/configure.ac +++ b/configure.ac @@ -14,7 +14,7 @@ dnl but WITHOUT ANY WARRANTY; without even the implied warranty of dnl MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU dnl General Public License for more details. -AC_INIT([abismal], [3.2.4], [andrewds@usc.edu], +AC_INIT([abismal], [3.3.0], [andrewds@usc.edu], [abismal], [https://github.com/smithlabcode/abismal]) dnl the config.h is not currently #included in the source, and only diff --git a/data/md5sum.txt b/data/md5sum.txt index 55ddc4c..4c168d3 100644 --- a/data/md5sum.txt +++ b/data/md5sum.txt @@ -10,7 +10,7 @@ e94e71292fccd255bad3f3694efe7a4b tests/reads_rpbat_pe_1.fq 8bfa00acd0639dc66acd5aa8ac0369d5 tests/reads_rpbat_pe_2.fq 3a0a49feb314f6cb8fa49c4c114b52fc tests/reads_rpbat_pe.mstats bcbf01be810cbf4051292813eb6b9225 tests/tRex1.idx -8e6ba9c3b111ab11453a99e83ad633db tests/reads_pbat_pe.sam -a7e5e4cbfed4491ad0bb6380416e4f13 tests/reads_pe.sam -817f180322f24dc1b542ef48e67ca38a tests/reads_rpbat_pe.sam -ccc7719b329d68fc4527361cd14849ee tests/reads.sam +ffff35a5110b3312555fc3ccaa78c99a tests/reads_pbat_pe.sam +af094fdc32b41a1605bb6d5b71e1b767 tests/reads_pe.sam +8370999d6ad6d5afc99ec483f16cc214 tests/reads_rpbat_pe.sam +a6d0c51aa69795b9bf3f9e424af35978 tests/reads.sam diff --git a/docs/MANUAL.md b/docs/MANUAL.md index 61e1c78..764e89d 100644 --- a/docs/MANUAL.md +++ b/docs/MANUAL.md @@ -1,83 +1,97 @@ --- title: "abismal reference manual" -author: "Guilherme Sena" +author: "Andrew D Smith and Guilherme Sena" --- # NAME -abismal - a fast and memory-efficient mapper for short whole genome -bisulfite sequencing reads. Official release, discussion, source code -and development versions are available at the abismal GitHub +abismal - a fast and memory-efficient mapper for short whole genome bisulfite +sequencing reads. Official release, discussion, source code and development +versions are available at the abismal GitHub [repo](https://github.com/smithlabcode/abismal). # SYNOPSIS -abismal [OPTIONS] -o output.sam [] +abismal map [OPTIONS] -o output.sam [] + +abismal idx [OPTIONS] + +# Examples -## Examples ``` -abismal -g ref.fa -o out.sam [-s out.stats] reads-1.fq -abismal -g ref.fa -o out.sam [-s out.stats] reads-1.fq reads-2.fq +abismal map -g ref.fa -o out.sam [-s out.stats] reads_1.fq +abismal map -g ref.fa -o out.sam [-s out.stats] reads_1.fq reads_2.fq ``` -If a ref.idx file was created using abismalidx: + +If a ref.idx file was created using `abismal idx`, which makes mapping faster: + ``` -abismal -i ref.idx -o out.sam [-s out.stats] reads-1.fq -abismal -i ref.idx -o out.sam [-s out.stats] reads-1.fq reads-2.fq +abismal map -i ref.idx -o out.sam [-s out.stats] reads_1.fq +abismal map -i ref.idx -o out.sam [-s out.stats] reads_1.fq reads_2.fq +``` + +To make an index from a FASTA format reference genome named `ref.fa` output to +the file named `ref.idx`: + +``` +abismal idx ref.fa ref.idx ``` # DESCRIPTION -abismal maps short bisulfite sequencing reads in FASTQ format to a -reference genome in FASTA format and produces an output file in SAM -format. Only reads that are mapped uniquely or ambiguously (optional) -are reported in the SAM output. +abismal maps short bisulfite sequencing reads in FASTQ format to a reference +genome in FASTA format and produces an output file in SAM format. Only reads +that are mapped uniquely or ambiguously (optional) are reported in the SAM +output. -abismal can map reads assuming either C>T or G>A conversion in -single-end reads. Both conversions are also accepted in paired-end -mapping, but corresponding read pairs are assume to have complementary -base conversions. The assumption of C>T conversion means Ts in the -read are considered matches to either Cs or Ts in the reference (in -either strand). The assumption of G>A conversion means that As in -reads are considered matches to either Gs or As in the reference. +abismal can map reads assuming either C>T or G>A conversion in single-end +reads. Both conversions are also accepted in paired-end mapping, but +corresponding read pairs are assume to have complementary base +conversions. The assumption of C>T conversion means Ts in the read are +considered matches to either Cs or Ts in the reference (in either strand). +The assumption of G>A conversion means that As in reads are considered matches +to either Gs or As in the reference. absimal was built to map short reads of up to 250 bases. It should -successfully map reads of size up to 1 million, but because it uses -very short seeds for filtration, the mapping time will increase -substantially. +successfully map reads of size up to 1 million, but because it uses very short +seeds for filtration, the mapping time will increase substantially. # QUICK INSTALLATION Run the following commands to install abismal + ``` -wget https://github.com/smithlabcode/abismal/releases/download/v3.2.4/abismal-3.2.4.tar.gz -tar -xvzf abismal-3.2.4.tar.gz -cd abismal-3.2.4 +wget https://github.com/smithlabcode/abismal/releases/download/v3.3.0/abismal-3.3.0.tar.gz +tar -xf abismal-3.3.0.tar.gz +cd abismal-3.3.0 ./configure --prefix=$(pwd) make make install ``` -This will create a bin directory where binaries abismalidx and abismal -are located. The abismalidx program indexes a FASTA reference genome, -and abismal maps FASTQ reads to a reference. + +These commands assume dependencies are available. The `README.md` file in the +GitHub abismal repo explains this in more detail. The above will create a bin +directory where the abismal binary is located. The abismalidx program has +been replaced by the `abismal idx` command. This indexes a FASTA reference +genome, and `abismal map` maps FASTQ reads to a reference. # OPTIONS -i FILE, -index FILE [required if -g not provided] -Input index file generated by abismalidx. Either the -g or -i -parameter must be provided (but not both). If -g is provided, abismal -will read the indexed genome before starting to map reads. Using -i is -recommended if several FASTQ files are mapped to the same genome. +Input index file generated by `abismal idx`. Either the -g or -i parameter +must be provided (but not both). If -g is provided, abismal will read the +indexed genome before starting to map reads. Using -i is recommended if +several FASTQ files are mapped to the same genome. -g FILE, -genome FILE [required if -i not provided] -Input FASTA genome. Either the -g or -i parameter must be provided -(but not both). If -g is provided, abismal will first index the input -FASTA genome before starting to map reads. Using -g is recommended if -only one sample will be mapped to a reference genome. In most -practical cases, however, it is preferable to first index the genome -using abismalidx, then using the output index file as input through -the -i flag. +Input FASTA genome. Either the -g or -i parameter must be provided (but not +both). If -g is provided, abismal will first index the input FASTA genome +before starting to map reads. Using -g is recommended if only one sample will +be mapped to a reference genome. In most practical cases, however, it is +preferable to first index the genome using `abismal idx`, then using the +output index file as input through the -i flag. -o FILE, -outfile FILE @@ -89,15 +103,16 @@ Using this argument, the output will be in BAM format. -s FILE, -stats FILE -Output mapping statistics file in YAML format. This file provides a -summary of mapping efficiency, detailing how many reads were zero, one -or multiple times. It also provides the error rate of mapped reads and -the number of reads that were too short to be mapped +Output mapping statistics file in YAML format. This file provides a summary of +mapping efficiency, detailing how many reads were zero, one or multiple +times. It also provides the error rate of mapped reads and the number of reads +that were too short to be mapped -The output is in YAML format, which is human-readable and can be -parsed in several programming languages. +The output is in YAML format, which is human-readable and can be parsed in +several programming languages. For single-end reads the statistics looks something like this: + ``` total_reads: 1000000 mapped: @@ -116,7 +131,9 @@ num_skipped: 0 percent_unmapped: 4.538 percent_skipped: 0 ``` + For paired-end reads it looks like this: + ``` pairs: total_pairs: 1000000 @@ -173,21 +190,19 @@ mate2: -t NUM-THREADS, -threads NUM-THREADS [default : 1] -number of threads that should be used to map reads. Each thread -reads 1000 reads in parallel, so increasing this number uses more -memory by a few kilobytes per additional thread. In most practical -cases this should not be significantly different than single-thread -mapping. +number of threads that should be used to map reads. Each thread reads 1000 +reads in parallel, so increasing this number uses more memory by a few +kilobytes per additional thread. In most practical cases this should not be +significantly different than single-thread mapping. -l MIN-FRAG-VALUE, -min-frag MIN-FRAG-VALUE [default : 32] -**For paired-end mode only**. The minimum size a concordant fragment -can have. There are cases in which concordant pairs can "dovetail", -that is, the end of the reverse mate can pass the start of the -forward mate. Ths parameter dictates the minimum size a dovetail -read can have while accepting concordance. The schematic below depicts -what the value of -l represents: +(paired-end only) The minimum size a concordant fragment can +have. There are cases in which concordant pairs can "dovetail", that is, the +end of the reverse mate can pass the start of the forward mate. Ths parameter +dictates the minimum size a dovetail read can have while accepting +concordance. The schematic below depicts what the value of -l represents: ``` [==================================>] [FORWARD MATE] @@ -198,11 +213,11 @@ what the value of -l represents: -L MAX-FRAG-VALUE, -max-frag MAX-FRAG-VALUE [default : 3000] -**For paired-end mode only**. The maximum size a concordant -fragment, defined as the maximum distance between the genome start -(smallest) coordinate of the forward mapped mate and the start -(largest) coordinate of the reverse mapped strand, for a pair to be -considered concordant. The schematic below depics how L is calculated. +(paired-end only) The maximum size a concordant fragment, defined +as the maximum distance between the genome start (smallest) coordinate of the +forward mapped mate and the start (largest) coordinate of the reverse mapped +strand, for a pair to be considered concordant. The schematic below depics how +L is calculated. ``` [===============>] [FORWARD MATE] @@ -212,50 +227,47 @@ considered concordant. The schematic below depics how L is calculated. -m MAX-FRACTION, -max-distance MAX-FRACTION [default : 0.1] -The maximum edit distance allowed for a read to be considered -"mapped", relative to the read's mapped length. Abismal will choose -the location in the genome with maximum alignment score. This location -will be reported if the number of mismatches, insertions and deletions -is at most -m times the mapped region of the read (i.e. excluding -soft-clipped bases). For instance, if a read of 150 bp is mapped to a -location with CIGAR string 20S100M30S, the read is allowed to have -at most 10 mismatches. +The maximum edit distance allowed for a read to be considered "mapped", +relative to the read's mapped length. Abismal will choose the location in the +genome with maximum alignment score. This location will be reported if the +number of mismatches, insertions and deletions is at most -m times the mapped +region of the read (i.e. excluding soft-clipped bases). For instance, if a +read of 150 bp is mapped to a location with CIGAR string 20S100M30S, the read +is allowed to have at most 10 mismatches. -a -ambig -If this flag is provided, abismal will report a random location to -reads that mapped ambiguously. These reads can be identified by the -presence of bit 0x100 in the SAM flags. +If this flag is provided, abismal will report a random location to reads that +mapped ambiguously. These reads can be identified by the presence of bit 0x100 +in the SAM flags. -P -pbat -**For paired-end mapping only**. Assumes the bisulfite conversion -of the first end to be G>A and the bisulfite conversion of the -second end to be C>T. +(paired-end only) Assumes the bisulfite conversion of the first +end to be G>A and the bisulfite conversion of the second end to be C>T. -R -random-pbat Maps reads in random PBAT mode. For single-end mapping, abismal will attempt to map reads assuming C>T -conversion, then G>A, and keeping the conversion that attains the -best alignment score. +conversion, then G>A, and keeping the conversion that attains the best +alignment score. -For paired-end mapping, abismal will attempt to map reads assuming end -1 has C>T conversion and end 2 has G>A conversion. Then it will -map the same reads assuming end 1 has G>A conversion. The conversion -that attains highest sum of alignment scores is kept. +For paired-end mapping, abismal will attempt to map reads assuming end 1 has +C>T conversion and end 2 has G>A conversion. Then it will map the same reads +assuming end 1 has G>A conversion. The conversion that attains highest sum of +alignment scores is kept. -A -a-rich -**For single-end mapping only**. Indicates that reads are A-rich. -Mapping with this flags assumes that the bisulfite conversion is G>A -instead of C>T. +(single-end only) Indicates that reads are A-rich. Mapping +with this flags assumes that the bisulfite conversion is G>A instead of C>T. -v -verbose -Prints more run info on the mapping progress, including a progress -bar showing the percentage of input reads currently processed. +Prints more run info on the mapping progress, including a progress bar showing +the percentage of input reads currently processed. # INPUT FASTQ FORMAT @@ -271,6 +283,7 @@ FASTQ format represents reads through 4 lines per read. the abismal algorithm. An example FASTQ file with one read looks like this: + ``` @1_chr3:131015484-131015553_R1 TTTATTAGGTAAGAAGGATAATAAGGGAGTTGAGTTTATGTGTTATAGAG @@ -278,131 +291,126 @@ TTTATTAGGTAAGAAGGATAATAAGGGAGTTGAGTTTATGTGTTATAGAG IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII ``` -Because reads take 4 lines per entry, the number of lines of the FASTQ -input must be a multiple of 4. This is a necessary but not sufficient -condition for properly formatted FASTQ. +Because reads take 4 lines per entry, the number of lines of the FASTQ input +must be a multiple of 4. This is a necessary but not sufficient condition for +properly formatted FASTQ. -For paired-end data, it is mandatory that both FASTQ ends have the -same number of lines. Corresponding entries in each file are assumed -to be mates. +For paired-end data, it is mandatory that both FASTQ ends have the same number +of lines. Corresponding entries in each file are assumed to be mates. # OUTPUT SAM FORMAT ## Output headers -abismal representes mapped reads in the Sequence Alignment/Map -(SAM) format. Detailed specification of SAM is available at -[the samtools documentation page](https://samtools.github.io). -BAM format is also supported, but the BAM format can be converted -to SAM using samtools if desired. +abismal representes mapped reads in the Sequence Alignment/Map (SAM) +format. Detailed specification of SAM is available at the +[samtools](https://samtools.github.io) documentation page. BAM format is also +supported. -The output starts with SAM headers. Headers contain metadata -information on the reference genome. Each header line starts with the -@ character. +The output starts with SAM headers. Headers contain metadata information on +the reference genome. Each header line starts with the @ character. The first line in the SAM output file is given by + ``` @HD VN:1.0 ``` -followed by one line per chromosome in the input FASTA file, encoding -the chromosome length. Each line is in the format + +followed by one line per chromosome in the input FASTA file, encoding the +chromosome length. Each line is in the format + ``` @SQ SN:[chrom-name] LN:[chrom-length] ``` -where [chrom-name] is given by the first word of the chromosome name -in the FASTA file (anything after the first white space is deleted), -and [chrom-length] is the number of bases in the chromosome -sequence. -The last line of the headers is a copy of how the program was called -to generate the SAM output, and is of the form +where [chrom-name] is given by the first word of the chromosome name in the +FASTA file (anything after the first white space is deleted), and +[chrom-length] is the number of bases in the chromosome sequence. + +The last line of the headers is a copy of how the program was called to +generate the SAM output, and is of the form + ``` -@PG ID:ABISMAL VN:3.2.4 CL:[command-call] +@PG ID:ABISMAL VN:3.3.0 CL:[command-call] ``` + where [command-call] is the shell command used to run abismal. ## Output mapped lines -Following the header lines, reads that are mapped once (or at least -once if the -a flag is used) are reported. Each read is a set of -thirteen tab-separated fields as follows. +Following the header lines, reads that are mapped once (or at least once if +the -a flag is used) are reported. Each read is a set of thirteen +tab-separated fields as follows. The first eleven fields are mandatory SAM fields: - - QNAME: The query name, given by the first word of the FASTQ - read name + - QNAME: The query name, given by the first word of the FASTQ read name - FLAG: List of samtools flags, according to the SAM file format documentation - - RNAME: The reference name, or the chromosome to which the read was - mapped - - POS: 1-based position in the chromosome in which the read was - mapepd - - MAPQ: Abismal does not provide MAPQ, so this value is always set to - 255 ("not defined" according to the SAM documentation) + - RNAME: The reference name, or the chromosome to which the read was mapped + - POS: 1-based position in the chromosome in which the read was mapepd + - MAPQ: Abismal does not provide MAPQ, so this value is always set to 255 + ("not defined" according to the SAM documentation) - CIGAR: A CIGAR string representing the read alignment. M stands for - matches, S are soft-clipped bases, I are insertions to the - reference and D are deletions from the reference - - RNEXT: This field is an equal sign (=) for reads that mapped - concordantly or an asterisk (*) for single-end reads or reads that - map discordantly. - - PNEXT: The position (POS field) of the read's mate. In single-end - reads this is given by an asterisk (*) - - TLEN: The fragment length for paired-end reads. It is a positive - value for if the first end maps to the forward strand and negative - otherwise. For discordant or single-end reads a value of 0 is - reported. - - SEQ: The input sequence identical to the FASTQ input (see - "important note on SEQ reads reported in the SAM output" below) - - QUAL: An asterisk (*). QUAL values are discarded on mapping and - not reported + matches, S are soft-clipped bases, I are insertions to the reference and D + are deletions from the reference + - RNEXT: This field is an equal sign (=) for reads that mapped concordantly + or an asterisk (*) for single-end reads or reads that map discordantly. + - PNEXT: The position (POS field) of the read's mate. In single-end reads + this is given by an asterisk (*) + - TLEN: The fragment length for paired-end reads. It is a positive value for + if the first end maps to the forward strand and negative otherwise. For + discordant or single-end reads a value of 0 is reported. + - SEQ: The input sequence identical to the FASTQ input (see "important note + on SEQ reads reported in the SAM output" below) + - QUAL: An asterisk (*). QUAL values are discarded on mapping and not + reported The last two fields are optional tags that can be used downstream - NM: The edit distance (mismatches + insertions + deletions) in the alignment between the read and reference - - CV: The bisulfite letter conversion assumed when mapping the read. - If the read was assumed A-rich (G>A conversion), the value - CV:A:A is reported. If the read was assumed T-rich (C>T - conversion), then the value CV:A:T is reported. If the -R flag is - not set, all reads coming from the same FASTQ file will have the same - assumed conversion. If the -R flag is used, this tag provides the - conversion used in the reported alignment. + - CV: The bisulfite letter conversion assumed when mapping the read. If the + read was assumed A-rich (G>A conversion), the value CV:A:A is reported. If + the read was assumed T-rich (C>T conversion), then the value CV:A:T is + reported. If the -R flag is not set, all reads coming from the same FASTQ + file will have the same assumed conversion. If the -R flag is used, this + tag provides the conversion used in the reported alignment. ## IMPORTANT NOTE ON SEQ READS REPORTED IN THE SAM OUTPUT -In a strict sense, the SEQ field reported by abismal does not comply -with the SAM standard. Properly formatted SAM files reverse-complement -reads that map to the reverse strand of the reference genome, whereas -abismal reports reads identical to the input FASTQ. This is done -deliberately to maintain consistency with the conversion type reported -in the CV tag. This standard allows downstream analyses on letter -frequencies and conversion types. Tools to reverse-complement the SEQ -field and the letter in the CV tag can be used if properly formatted -SAM files are necessary. +In a strict sense, the SEQ field reported by abismal does not comply with the +SAM standard. Properly formatted SAM files reverse-complement reads that map +to the reverse strand of the reference genome, whereas abismal reports reads +identical to the input FASTQ. This is done deliberately to maintain +consistency with the conversion type reported in the CV tag. This standard +allows downstream analyses on letter frequencies and conversion types. Tools +to reverse-complement the SEQ field and the letter in the CV tag can be used +if properly formatted SAM files are necessary. ## Filtering concordant/discordant pairs in paired-end SAM -In paired-end mapping, abismal will try to mate concordant pairs and -find pairs whose sum of edit distances is below the user-defined -cut-off (set to 10\% of the mapped read length by default). If it -cannot find such concordant pair (either due to ambiguity or low -alignment score), abismal will map each end as single-end reads and -report locations with sufficiently high similarity. In many cases, one -may want to only keep concordant pairs, or even isolate discordant -pairs to analyze their sequence patterns, mapping quality, etc. +In paired-end mapping, abismal will try to mate concordant pairs and find +pairs whose sum of edit distances is below the user-defined cut-off (set to +10\% of the mapped read length by default). If it cannot find such concordant +pair (either due to ambiguity or low alignment score), abismal will map each +end as single-end reads and report locations with sufficiently high +similarity. In many cases, one may want to only keep concordant pairs, or even +isolate discordant pairs to analyze their sequence patterns, mapping quality, +etc. To isolate concordant pairs from paired-end output `out.sam`, use the -following awk script, which isolates the SAM header and reads with an -equal (=) symbol in the RNEXT field (reserved only for concordant -pairs). +following awk script, which isolates the SAM header and reads with an equal +(=) symbol in the RNEXT field (reserved only for concordant pairs). + ``` awk '$1 ~ /^@/ || $7 == "="' out.sam >out_paired.sam ``` -To isolate discordantly mapped reads from paired-end output `out.sam`, -use the following awk script, which isolates the SAM header and reads -with an asterisk (*) in the RNEXT field (reserved for discordantly -mapped reads). +To isolate discordantly mapped reads from paired-end output `out.sam`, use the +following awk script, which isolates the SAM header and reads with an asterisk +(*) in the RNEXT field (reserved for discordantly mapped reads). + ``` awk '$1 ~ /^@/ || $7 == "*"' out.sam >out_single.sam ``` @@ -414,72 +422,65 @@ options. ## T-rich and A-rich reads -Abismal uses the notation of T-rich and A-rich reads for input -reads. We say a read is T-rich if it was sequenced directly after -bisulfite conversion, and we say a read is A-rich if it the complement -of the bisulfite-converted DNA was sequenced. Single-end and end 1 of -paired-end reads are usually T-rich, and end 2 of paired-end reads are -usually A-rich. - -For most organisms, we can infer if an input is T-rich based on the frequencies -of Ts and Cs (for animal samples and most plants). Half of the T-rich bases -should be Ts, and the frequency of Cs should be extremely low. Conversely, in -A-rich samples half of the bases are As, and the frequency of Gs is very low. If -neither case applies for the dataset, it may be either an RPBAT sample or not a -bisulfite sequencing sample (see "notation used by abismal" below for a -description of what RPBAT samples are) +Abismal uses the notation of T-rich and A-rich reads for input reads. We say a +read is T-rich if it was sequenced directly after bisulfite conversion, and we +say a read is A-rich if it the complement of the bisulfite-converted DNA was +sequenced. Single-end and end 1 of paired-end reads are usually T-rich, and +end 2 of paired-end reads are usually A-rich. + +For most organisms, we can infer if an input is T-rich based on the +frequencies of Ts and Cs (for animal samples and most plants). Half of the +T-rich bases should be Ts, and the frequency of Cs should be extremely +low. Conversely, in A-rich samples half of the bases are As, and the frequency +of Gs is very low. If neither case applies for the dataset, it may be either +an RPBAT sample or not a bisulfite sequencing sample (see "notation used by +abismal" below for a description of what RPBAT samples are) ## Traditional, PBAT and RPBAT reads Reads that follow the expected T-rich/A-rich convention are called -traditional. Inputs are assumed traditional by default. In other -words, if no flags are provided, A single-end input, as well as end 1 -of paired-end reads, will be assumed T-rich, and end 2 will be assumed -A-rich. - -We say reads are PBAT (often from the Post-Bisulfite Adapter -Tagging protocol) if they follow the opposite convention of -traditional reads. This means that a PBAT single-end input is A-rich, -and a PBAT paired-end input is A-rich in end 1 and T-rich in end 2. -Reads can be mapped in PBAT mode by using the -p flag or the -pbat -flag among the [options]. - -We say reads are RPBAT (from the Random-priming PBAT protocol) if -reads in the input can be T-rich or A-rich with equal probability, and -both must be tried. For paired-end input, we always assume that -corresponding ends of reads have complementary bisulfite bases. In -other words, if end 1 is T-rich, then end 2 is A-rich, and vice-versa. -Reads can be mapped as RPBAT by passing the -R flag or the --random-pbat flag among the [options]. +traditional. Inputs are assumed traditional by default. In other words, if no +flags are provided, A single-end input, as well as end 1 of paired-end reads, +will be assumed T-rich, and end 2 will be assumed A-rich. + +We say reads are PBAT (often from the Post-Bisulfite Adapter Tagging protocol) +if they follow the opposite convention of traditional reads. This means that a +PBAT single-end input is A-rich, and a PBAT paired-end input is A-rich in end +1 and T-rich in end 2. Reads can be mapped in PBAT mode by using the -p flag +or the -pbat flag among the [options]. + +We say reads are RPBAT (from the Random-priming PBAT protocol) if reads in the +input can be T-rich or A-rich with equal probability, and both must be +tried. For paired-end input, we always assume that corresponding ends of reads +have complementary bisulfite bases. In other words, if end 1 is T-rich, then +end 2 is A-rich, and vice-versa. Reads can be mapped as RPBAT by passing the +-R flag or the -random-pbat flag among the [options]. ### What if the protocol (traditional, PBAT, RPBAT) is not known? If the sequencing protocol is not known, we suggest trying all 3 -possibilities. You will always get most reads by mapping as RPBAT, but -that does not mean this is always the best option. For instance, if -reads come from the traditional single-end protocol, the reads mapped -as A-rich may be false positives and lead to incorrect downstream -analyses. You should only map as RPBAT if you are getting very low -mapping values in both traditional and PBAT modes, which suggests that -reads are truly RPBAT. +possibilities. You will always get most reads by mapping as RPBAT, but that +does not mean this is always the best option. For instance, if reads come from +the traditional single-end protocol, the reads mapped as A-rich may be false +positives and lead to incorrect downstream analyses. You should only map as +RPBAT if you are getting very low mapping values in both traditional and PBAT +modes, which suggests that reads are truly RPBAT. ## Valid hits -Abismal maps reads by first computing Hamming distances between -the read and the candidates retrieved during seeding. Hamming distance -is the number of mismatches between the read and the candidate, so no -insertions and deletions are made. We say a candidate is a valid hit -for a read if the Hamming distance is below 40% of the read length. -Hits that are not valid will not be aligned, as they are extremely -unlikely to yield high alignment scores. +Abismal maps reads by first computing Hamming distances between the read and +the candidates retrieved during seeding. Hamming distance is the number of +mismatches between the read and the candidate, so no insertions and deletions +are made. We say a candidate is a valid hit for a read if the Hamming distance +is below 40% of the read length. Hits that are not valid will not be aligned, +as they are extremely unlikely to yield high alignment scores. ## Alignment scores and edit distances -Abismal aligns reads through a banded Smith-Waterman alignment, using -a band width of 3. We use a scoring system of +1 for matches, -1 for -mismatches and -1 for indels. In other words, if an alignment has M -matches, m mismatches, I insertions and D deletions, the alignment -score A is given by +Abismal aligns reads through a banded Smith-Waterman alignment, using a band +width of 3. We use a scoring system of +1 for matches, -1 for mismatches and +-1 for indels. In other words, if an alignment has M matches, m mismatches, I +insertions and D deletions, the alignment score A is given by $$A = a - I - D$$ @@ -487,42 +488,40 @@ and the edit distance E is given by $$E = m + I + D$$ -Abismal selects the best alignment score, and only reports it if the -edit distance is below a fraction m (set under the -m or --max-fraction flag) of the read length. For example, if the read -length is 100 and m =0.1, the read is only reported if the edit -distance is at most 10. +Abismal selects the best alignment score, and only reports it if the edit +distance is below a fraction m (set under the -m or -max-fraction flag) of the +read length. For example, if the read length is 100 and m =0.1, the read is +only reported if the edit distance is at most 10. -We also note that abismal does not use phred quality scores in -alignment, so a mismatch on a low quality base is the same as a -mismatch in a high-quality base. +We also note that abismal does not use phred quality scores in alignment, so a +mismatch on a low quality base is the same as a mismatch in a high-quality +base. ## Valid alignments -We say a candidate is a valid alignment if the edit distance is at -most a fraction m of the read length. The acceptable fraction can be -set through the -m or -max-distance flag in [options]. Some -application-specific cases may require more or less acceptable error, -so this parameter can be adjusted by the user. +We say a candidate is a valid alignment if the edit distance is at most a +fraction m of the read length. The acceptable fraction can be set through the +-m or -max-distance flag in [options]. Some application-specific cases may +require more or less acceptable error, so this parameter can be adjusted by +the user. ## Concordant pairs -On paired-end input, we say reads $r_1$ and $r_2$ with lengths -$n_1$ and $n_2$, respectively, are a corresponding pair are -concordant if there exist positions $p_1$ and $p_2$ such that +On paired-end input, we say reads $r_1$ and $r_2$ with lengths $n_1$ and +$n_2$, respectively, are a corresponding pair are concordant if there exist +positions $p_1$ and $p_2$ such that 1. $p_1$ is a valid alignment for $r_1$ and $p_2$ is a valid alignment for $r_2$ 2. $p_1$ and $p_2$ are in opposite strands of the genome, and 3. $p_2 - p_1 \geq l$ and $p_2 - p_1 \leq L - n2$. -This means that the fragment lengths from which the pair originates is -at least $l$ and at most $L$. The default values of $l$ and $L$ are 32 -and 3000, respectively, and can be set by the `-l` and `-L` flags. -Those are conservative values that cover most of the current -protocols. We will incorporate automatic calculations of these values -in the future based on the first high-quality read pairs that are -mapped. +This means that the fragment lengths from which the pair originates is at +least $l$ and at most $L$. The default values of $l$ and $L$ are 32 and 3000, +respectively, and can be set by the `-l` and `-L` flags. Those are +conservative values that cover most of the current protocols. We will +incorporate automatic calculations of these values in the future based on the +first high-quality read pairs that are mapped. ## Discordant pairs @@ -535,18 +534,18 @@ simultaneously ## Reporting non-concordant reads -Read pairs that are not concordant (including discordant reads) do not -follow the expectations, but in cases where they map with high -quality, they will be reported. If read pairs are not concordant, -abismal maps each end independently as single-end reads. If users do -not which to keep single-end alignments in paired-end data, they can -filter out single-end reads of an output file out.sam -though the following command. +Read pairs that are not concordant (including discordant reads) do not follow +the expectations, but in cases where they map with high quality, they will be +reported. If read pairs are not concordant, abismal maps each end +independently as single-end reads. If users do not which to keep single-end +alignments in paired-end data, they can filter out single-end reads of an +output file out.sam though the following command. + ``` samtools view out.sam | awk '$7 == "="' >out_paired.sam ``` -The resulting file (out_paired.sam) will contain only concordant -pairs. + +The resulting file (out_paired.sam) will contain only concordant pairs. # EXIT VALUES @@ -557,24 +556,22 @@ pairs. # REPORTING BUGS -Bugs can be reported at the abismal GitHub page at the [issues -section](https://github.com/smithlabcode/abismal/issues). the abismal -GitHub page) as well as by e-mail to desenabr@usc.edu or -andrewds@usc.edu. When reporting bugs, please provide the version of -abismal used and the operating system used to run abismal. It is also -helpful, when relevant, to provide small input datasets and the genome -used so we can reproduce the issue and find the source of the problem. +Bugs can be reported at the abismal GitHub +[issues](https://github.com/smithlabcode/abismal/issues) page or by e-mail to +andrewds@usc.edu. When reporting bugs, please provide the version of abismal +used and the operating system used to run abismal. It is also helpful, when +relevant, to provide small input datasets and the genome used so we can +reproduce the issue and find the source of the problem. # COPYRIGHT -Copyright (C) 2018-2023 Andrew D. Smith and Guilherme Sena +Copyright (C) 2018-2025 Andrew D. Smith and Guilherme Sena -abismal is free software: you can redistribute it and/or modify it -under the terms of the GNU General Public License as published by the -Free Software Foundation, either version 3 of the License, or (at your -option) any later version. +abismal is free software: you can redistribute it and/or modify it under the +terms of the GNU General Public License as published by the Free Software +Foundation, either version 3 of the License, or (at your option) any later +version. -abismal is distributed in the hope that it will be useful, but WITHOUT -ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or -FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License -for more details. +abismal is distributed in the hope that it will be useful, but WITHOUT ANY +WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR +A PARTICULAR PURPOSE. See the GNU General Public License for more details. diff --git a/docs/abismal.1 b/docs/abismal.1 index ea5ac19..3e3f6ae 100644 --- a/docs/abismal.1 +++ b/docs/abismal.1 @@ -11,22 +11,34 @@ available at the abismal GitHub repo (https://github.com/smithlabcode/abismal). .SH SYNOPSIS .PP -abismal OPTIONS -o output.sam [] -.SS Examples +abismal map OPTIONS -o output.sam [] +.PP +abismal idx OPTIONS +.SH Examples +.IP +.nf +\f[C] +abismal map -g ref.fa -o out.sam [-s out.stats] reads_1.fq +abismal map -g ref.fa -o out.sam [-s out.stats] reads_1.fq reads_2.fq +\f[R] +.fi +.PP +If a ref.idx file was created using \f[C]abismal idx\f[R], which makes +mapping faster: .IP .nf \f[C] -abismal -g ref.fa -o out.sam [-s out.stats] reads-1.fq -abismal -g ref.fa -o out.sam [-s out.stats] reads-1.fq reads-2.fq +abismal map -i ref.idx -o out.sam [-s out.stats] reads_1.fq +abismal map -i ref.idx -o out.sam [-s out.stats] reads_1.fq reads_2.fq \f[R] .fi .PP -If a ref.idx file was created using abismalidx: +To make an index from a FASTA format reference genome named +\f[C]ref.fa\f[R] output to the file named \f[C]ref.idx\f[R]: .IP .nf \f[C] -abismal -i ref.idx -o out.sam [-s out.stats] reads-1.fq -abismal -i ref.idx -o out.sam [-s out.stats] reads-1.fq reads-2.fq +abismal idx ref.fa ref.idx \f[R] .fi .SH DESCRIPTION @@ -57,24 +69,29 @@ Run the following commands to install abismal .IP .nf \f[C] -wget https://github.com/smithlabcode/abismal/releases/download/v3.2.4/abismal-3.2.4.tar.gz -tar -xvzf abismal-3.2.4.tar.gz -cd abismal-3.2.4 +wget https://github.com/smithlabcode/abismal/releases/download/v3.3.0/abismal-3.3.0.tar.gz +tar -xf abismal-3.3.0.tar.gz +cd abismal-3.3.0 \&./configure --prefix=$(pwd) make make install \f[R] .fi .PP -This will create a bin directory where binaries abismalidx and abismal -are located. -The abismalidx program indexes a FASTA reference genome, and abismal -maps FASTQ reads to a reference. +These commands assume dependencies are available. +The \f[C]README.md\f[R] file in the GitHub abismal repo explains this in +more detail. +The above will create a bin directory where the abismal binary is +located. +The abismalidx program has been replaced by the \f[C]abismal idx\f[R] +command. +This indexes a FASTA reference genome, and \f[C]abismal map\f[R] maps +FASTQ reads to a reference. .SH OPTIONS .PP -i FILE, -index FILE [required if -g not provided] .PP -Input index file generated by abismalidx. +Input index file generated by \f[C]abismal idx\f[R]. Either the -g or -i parameter must be provided (but not both). If -g is provided, abismal will read the indexed genome before starting to map reads. @@ -90,8 +107,8 @@ before starting to map reads. Using -g is recommended if only one sample will be mapped to a reference genome. In most practical cases, however, it is preferable to first index the -genome using abismalidx, then using the output index file as input -through the -i flag. +genome using \f[C]abismal idx\f[R], then using the output index file as +input through the -i flag. .PP -o FILE, -outfile FILE .PP @@ -204,8 +221,7 @@ single-thread mapping. .PP -l MIN-FRAG-VALUE, -min-frag MIN-FRAG-VALUE [default : 32] .PP -\f[B]For paired-end mode only\f[R]. -The minimum size a concordant fragment can have. +(paired-end only) The minimum size a concordant fragment can have. There are cases in which concordant pairs can \[lq]dovetail\[rq], that is, the end of the reverse mate can pass the start of the forward mate. Ths parameter dictates the minimum size a dovetail read can have while @@ -222,11 +238,10 @@ The schematic below depicts what the value of -l represents: .PP -L MAX-FRAG-VALUE, -max-frag MAX-FRAG-VALUE [default : 3000] .PP -\f[B]For paired-end mode only\f[R]. -The maximum size a concordant fragment, defined as the maximum distance -between the genome start (smallest) coordinate of the forward mapped -mate and the start (largest) coordinate of the reverse mapped strand, -for a pair to be considered concordant. +(paired-end only) The maximum size a concordant fragment, defined as the +maximum distance between the genome start (smallest) coordinate of the +forward mapped mate and the start (largest) coordinate of the reverse +mapped strand, for a pair to be considered concordant. The schematic below depics how L is calculated. .IP .nf @@ -258,9 +273,8 @@ flags. .PP -P -pbat .PP -\f[B]For paired-end mapping only\f[R]. -Assumes the bisulfite conversion of the first end to be G>A and the -bisulfite conversion of the second end to be C>T. +(paired-end only) Assumes the bisulfite conversion of the first end to +be G>A and the bisulfite conversion of the second end to be C>T. .PP -R -random-pbat .PP @@ -277,8 +291,7 @@ The conversion that attains highest sum of alignment scores is kept. .PP -A -a-rich .PP -\f[B]For single-end mapping only\f[R]. -Indicates that reads are A-rich. +(single-end only) Indicates that reads are A-rich. Mapping with this flags assumes that the bisulfite conversion is G>A instead of C>T. .PP @@ -328,10 +341,9 @@ Corresponding entries in each file are assumed to be mates. .PP abismal representes mapped reads in the Sequence Alignment/Map (SAM) format. -Detailed specification of SAM is available at the samtools documentation -page (https://samtools.github.io). -BAM format is also supported, but the BAM format can be converted to SAM -using samtools if desired. +Detailed specification of SAM is available at the +samtools (https://samtools.github.io) documentation page. +BAM format is also supported. .PP The output starts with SAM headers. Headers contain metadata information on the reference genome. @@ -364,7 +376,7 @@ generate the SAM output, and is of the form .IP .nf \f[C] -\[at]PG ID:ABISMAL VN:3.2.4 CL:\[dq][command-call]\[dq] +\[at]PG ID:ABISMAL VN:3.3.0 CL:[command-call] \f[R] .fi .PP @@ -638,10 +650,9 @@ When the program, fails, an error message will be shown in STDERR describing the problem encountered. .SH REPORTING BUGS .PP -Bugs can be reported at the abismal GitHub page at the issues -section (https://github.com/smithlabcode/abismal/issues). -the abismal GitHub page) as well as by e-mail to desenabr\[at]usc.edu or -andrewds\[at]usc.edu. +Bugs can be reported at the abismal GitHub +issues (https://github.com/smithlabcode/abismal/issues) page or by +e-mail to andrewds\[at]usc.edu. When reporting bugs, please provide the version of abismal used and the operating system used to run abismal. It is also helpful, when relevant, to provide small input datasets and @@ -649,7 +660,7 @@ the genome used so we can reproduce the issue and find the source of the problem. .SH COPYRIGHT .PP -Copyright (C) 2018-2023 Andrew D. +Copyright (C) 2018-2025 Andrew D. Smith and Guilherme Sena .PP abismal is free software: you can redistribute it and/or modify it under @@ -662,4 +673,4 @@ ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details. .SH AUTHORS -Guilherme Sena. +Andrew D Smith and Guilherme Sena. diff --git a/docs/abismal.html b/docs/abismal.html index 13a4500..c0fbe00 100644 --- a/docs/abismal.html +++ b/docs/abismal.html @@ -1,82 +1,56 @@ - - - - - - abismal reference manual - - - - abismal reference manual - - - - - - - -
- -
-
-

abismal reference manual

-

Guilherme Sena

-
-
- -
- - -
+ + + + + + + abismal reference manual + + + +
+

abismal reference manual

+

Andrew D Smith and Guilherme Sena

+

NAME

abismal - a fast and memory-efficient mapper for short whole genome bisulfite sequencing reads. Official release, discussion, source code and development versions are available at the abismal GitHub repo.

SYNOPSIS

-

abismal OPTIONS -o output.sam <input-1.fastq> [<input-2.fastq>]

-

Examples

-
abismal -g ref.fa -o out.sam [-s out.stats] reads-1.fq
-abismal -g ref.fa -o out.sam [-s out.stats] reads-1.fq reads-2.fq
-

If a ref.idx file was created using abismalidx:

-
abismal -i ref.idx -o out.sam [-s out.stats] reads-1.fq
-abismal -i ref.idx -o out.sam [-s out.stats] reads-1.fq reads-2.fq
+

abismal map OPTIONS -o output.sam <input-1.fastq> [<input-2.fastq>]

+

abismal idx OPTIONS <reference-genome.fa>

+

Examples

+
abismal map -g ref.fa -o out.sam [-s out.stats] reads_1.fq
+abismal map -g ref.fa -o out.sam [-s out.stats] reads_1.fq reads_2.fq
+

If a ref.idx file was created using abismal idx, which makes mapping faster:

+
abismal map -i ref.idx -o out.sam [-s out.stats] reads_1.fq
+abismal map -i ref.idx -o out.sam [-s out.stats] reads_1.fq reads_2.fq
+

To make an index from a FASTA format reference genome named ref.fa output to the file named ref.idx:

+
abismal idx ref.fa ref.idx

DESCRIPTION

abismal maps short bisulfite sequencing reads in FASTQ format to a reference genome in FASTA format and produces an output file in SAM format. Only reads that are mapped uniquely or ambiguously (optional) are reported in the SAM output.

abismal can map reads assuming either C>T or G>A conversion in single-end reads. Both conversions are also accepted in paired-end mapping, but corresponding read pairs are assume to have complementary base conversions. The assumption of C>T conversion means Ts in the read are considered matches to either Cs or Ts in the reference (in either strand). The assumption of G>A conversion means that As in reads are considered matches to either Gs or As in the reference.

absimal was built to map short reads of up to 250 bases. It should successfully map reads of size up to 1 million, but because it uses very short seeds for filtration, the mapping time will increase substantially.

QUICK INSTALLATION

Run the following commands to install abismal

-
wget https://github.com/smithlabcode/abismal/releases/download/v3.2.4/abismal-3.2.4.tar.gz
-tar -xvzf abismal-3.2.4.tar.gz
-cd abismal-3.2.4
+
wget https://github.com/smithlabcode/abismal/releases/download/v3.3.0/abismal-3.3.0.tar.gz
+tar -xf abismal-3.3.0.tar.gz
+cd abismal-3.3.0
 ./configure --prefix=$(pwd)
 make
 make install
-

This will create a bin directory where binaries abismalidx and abismal are located. The abismalidx program indexes a FASTA reference genome, and abismal maps FASTQ reads to a reference.

+

These commands assume dependencies are available. The README.md file in the GitHub abismal repo explains this in more detail. The above will create a bin directory where the abismal binary is located. The abismalidx program has been replaced by the abismal idx command. This indexes a FASTA reference genome, and abismal map maps FASTQ reads to a reference.

OPTIONS

-i FILE, -index FILE [required if -g not provided]

-

Input index file generated by abismalidx. Either the -g or -i parameter must be provided (but not both). If -g is provided, abismal will read the indexed genome before starting to map reads. Using -i is recommended if several FASTQ files are mapped to the same genome.

+

Input index file generated by abismal idx. Either the -g or -i parameter must be provided (but not both). If -g is provided, abismal will read the indexed genome before starting to map reads. Using -i is recommended if several FASTQ files are mapped to the same genome.

-g FILE, -genome FILE [required if -i not provided]

-

Input FASTA genome. Either the -g or -i parameter must be provided (but not both). If -g is provided, abismal will first index the input FASTA genome before starting to map reads. Using -g is recommended if only one sample will be mapped to a reference genome. In most practical cases, however, it is preferable to first index the genome using abismalidx, then using the output index file as input through the -i flag.

+

Input FASTA genome. Either the -g or -i parameter must be provided (but not both). If -g is provided, abismal will first index the input FASTA genome before starting to map reads. Using -g is recommended if only one sample will be mapped to a reference genome. In most practical cases, however, it is preferable to first index the genome using abismal idx, then using the output index file as input through the -i flag.

-o FILE, -outfile FILE

Output file in SAM format by default. This argument is required.

-B, -bam

@@ -156,12 +130,12 @@

OPTIONS

-t NUM-THREADS, -threads NUM-THREADS [default : 1]

number of threads that should be used to map reads. Each thread reads 1000 reads in parallel, so increasing this number uses more memory by a few kilobytes per additional thread. In most practical cases this should not be significantly different than single-thread mapping.

-l MIN-FRAG-VALUE, -min-frag MIN-FRAG-VALUE [default : 32]

-

For paired-end mode only. The minimum size a concordant fragment can have. There are cases in which concordant pairs can “dovetail”, that is, the end of the reverse mate can pass the start of the forward mate. Ths parameter dictates the minimum size a dovetail read can have while accepting concordance. The schematic below depicts what the value of -l represents:

+

(paired-end only) The minimum size a concordant fragment can have. There are cases in which concordant pairs can “dovetail”, that is, the end of the reverse mate can pass the start of the forward mate. Ths parameter dictates the minimum size a dovetail read can have while accepting concordance. The schematic below depicts what the value of -l represents:

                    [==================================>] [FORWARD MATE]
 [<==================================] [REVERSE MATE]
                     |-------l-------|

-L MAX-FRAG-VALUE, -max-frag MAX-FRAG-VALUE [default : 3000]

-

For paired-end mode only. The maximum size a concordant fragment, defined as the maximum distance between the genome start (smallest) coordinate of the forward mapped mate and the start (largest) coordinate of the reverse mapped strand, for a pair to be considered concordant. The schematic below depics how L is calculated.

+

(paired-end only) The maximum size a concordant fragment, defined as the maximum distance between the genome start (smallest) coordinate of the forward mapped mate and the start (largest) coordinate of the reverse mapped strand, for a pair to be considered concordant. The schematic below depics how L is calculated.

[===============>] [FORWARD MATE]
                                          [<==============] [REVERSE MATE]
 |---------------------------L----------------------------|
@@ -170,13 +144,13 @@

OPTIONS

-a -ambig

If this flag is provided, abismal will report a random location to reads that mapped ambiguously. These reads can be identified by the presence of bit 0x100 in the SAM flags.

-P -pbat

-

For paired-end mapping only. Assumes the bisulfite conversion of the first end to be G>A and the bisulfite conversion of the second end to be C>T.

+

(paired-end only) Assumes the bisulfite conversion of the first end to be G>A and the bisulfite conversion of the second end to be C>T.

-R -random-pbat

Maps reads in random PBAT mode.

For single-end mapping, abismal will attempt to map reads assuming C>T conversion, then G>A, and keeping the conversion that attains the best alignment score.

For paired-end mapping, abismal will attempt to map reads assuming end 1 has C>T conversion and end 2 has G>A conversion. Then it will map the same reads assuming end 1 has G>A conversion. The conversion that attains highest sum of alignment scores is kept.

-A -a-rich

-

For single-end mapping only. Indicates that reads are A-rich. Mapping with this flags assumes that the bisulfite conversion is G>A instead of C>T.

+

(single-end only) Indicates that reads are A-rich. Mapping with this flags assumes that the bisulfite conversion is G>A instead of C>T.

-v -verbose

Prints more run info on the mapping progress, including a progress bar showing the percentage of input reads currently processed.

INPUT FASTQ FORMAT

@@ -196,7 +170,7 @@

INPUT FASTQ FORMAT

For paired-end data, it is mandatory that both FASTQ ends have the same number of lines. Corresponding entries in each file are assumed to be mates.

OUTPUT SAM FORMAT

Output headers

-

abismal representes mapped reads in the Sequence Alignment/Map (SAM) format. Detailed specification of SAM is available at the samtools documentation page. BAM format is also supported, but the BAM format can be converted to SAM using samtools if desired.

+

abismal representes mapped reads in the Sequence Alignment/Map (SAM) format. Detailed specification of SAM is available at the samtools documentation page. BAM format is also supported.

The output starts with SAM headers. Headers contain metadata information on the reference genome. Each header line starts with the @ character.

The first line in the SAM output file is given by

@HD VN:1.0
@@ -204,7 +178,7 @@

Output headers

@SQ SN:[chrom-name] LN:[chrom-length]

where [chrom-name] is given by the first word of the chromosome name in the FASTA file (anything after the first white space is deleted), and [chrom-length] is the number of bases in the chromosome sequence.

The last line of the headers is a copy of how the program was called to generate the SAM output, and is of the form

-
@PG ID:ABISMAL  VN:3.2.4  CL:"[command-call]"
+
@PG ID:ABISMAL  VN:3.3.0  CL:[command-call]

where [command-call] is the shell command used to run abismal.

Output mapped lines

Following the header lines, reads that are mapped once (or at least once if the -a flag is used) are reported. Each read is a set of thirteen tab-separated fields as follows.

@@ -280,13 +254,10 @@

EXIT VALUES

0 : Success.

1 : Runtime error. When the program, fails, an error message will be shown in STDERR describing the problem encountered.

REPORTING BUGS

-

Bugs can be reported at the abismal GitHub page at the issues section. the abismal GitHub page) as well as by e-mail to desenabr@usc.edu or andrewds@usc.edu. When reporting bugs, please provide the version of abismal used and the operating system used to run abismal. It is also helpful, when relevant, to provide small input datasets and the genome used so we can reproduce the issue and find the source of the problem.

+

Bugs can be reported at the abismal GitHub issues page or by e-mail to andrewds@usc.edu. When reporting bugs, please provide the version of abismal used and the operating system used to run abismal. It is also helpful, when relevant, to provide small input datasets and the genome used so we can reproduce the issue and find the source of the problem.

COPYRIGHT

-

Copyright (C) 2018-2023 Andrew D. Smith and Guilherme Sena

+

Copyright (C) 2018-2025 Andrew D. Smith and Guilherme Sena

abismal is free software: you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version.

abismal is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details.

-
-
-
- + diff --git a/src/abismal.cpp b/src/abismal.cpp index 79dd927..82eb613 100644 --- a/src/abismal.cpp +++ b/src/abismal.cpp @@ -1766,6 +1766,7 @@ run_single_ended(const bool show_progress, const bool allow_ambig, thread.join(); if (show_progress) { + std::cerr << "\n"; const auto stop_time{abismal_clock::now()}; log_msg("reads mapped: " + std::to_string(rl.get_current_read())); log_msg("total mapping time: " + format_duration(start_time, stop_time)); @@ -2325,6 +2326,7 @@ run_paired_ended(const bool show_progress, const bool allow_ambig, thread.join(); if (show_progress) { + std::cerr << "\n"; const auto stop_time{abismal_clock::now()}; log_msg("reads mapped: " + std::to_string(rl1.get_current_read())); log_msg("total mapping time: " + format_duration(start_time, stop_time));