Skip to content

Bad alloc... maybe bad alignment? #42

Description

@otaviolovison

Dear community,

I am having memory issues while running the "Bioconductor workflow for microbiome data analysis: from raw reads to community analyses" and need some orientation. I am using this workflow for more then 2 years and this is the first time I am having issues with this amazing workflow. Let me put it in context.

I am working with upper respiratory tract samples. Libraries were performed following Illumina's protocol (16S v3v4), but sequenced with MiSeq V2 kit (we know that MiSeq V3 kit is recommended and will be using it from now on, but we have also performed successful analyses with V2 kit before). Since we had problems with MiSeq, we have a batch effect on these samples, but I believe this is not the problem. I believe we had a good sequencing quality also (attached - samples with low quality were removed - M49, M66...).
RevQualityPlot.pdf
FwdQualityPlot.pdf

The filter and trim was perfomed as follows:

for(i in seq_along(fnFs)) {
fastqPairedFilter(c(fnFs[[i]], fnRs[[i]]),
c(fnFs[[i]], fnRs[[i]]),
trimLeft=c(27,31), truncLen=c(240, 240),
maxN=0, maxEE=2, truncQ=2,
compress=TRUE)
}

I performed a pooled inference of sequences which resulted in around 11 million sequences and more then 1 million unique sequences.

dadaFs <- dada(derepFs, err=ddF[[1]]$err_out, pool=TRUE)
dadaRs <- dada(derepRs, err=ddR[[1]]$err_out, pool=TRUE)

After merging, chimera removal and trimming of non-target length sequences, these resulted in which I think is a plausible number of ASVs.

mergers <- mergePairs(dadaFs, derepFs, dadaRs, derepRs)
seqtab.all <- makeSequenceTable(mergers[!grepl("Mock", names(mergers))])
seqtab2 <- removeBimeraDenovo(seqtab.all)
table(nchar(getSequences(seqtab2)))

If I am not wrong, my expected amplicon size is ~402. The majority of my sequences length were between 398-409 (402 and 407 had most of them), so I removed the sequences outside this range.

seqtab2 <- seqtab[,nchar(colnames(seqtab)) %in% seq(398,408)]

Reads through the pipeline:

Reads through the pipeline
input filtered denoisedF denoisedR merged nonchim
M03 213895 177214 176972 176990 174471 146119
M04 214250 177991 176814 177140 167873 63085
M05 111659 76484 76335 76409 75549 72923
M06 205858 144111 142696 143278 133717 66378
M07 221406 183339 182223 182781 175147 83000
M08 202433 171376 170365 170763 162915 100385
M10 260685 212201 211421 211882 206629 157737
M11 168531 120209 119097 119297 111884 51200
M13 225351 186372 185161 185649 177682 95916
M14 238618 193832 192852 193165 186893 121542
M15 255374 176133 175008 175574 167765 69808
M16 210136 139221 138490 138794 132123 50854
M17 173373 123043 122929 122965 121961 110472
M18 234219 161497 159773 159856 149661 90075
M20 205971 137406 136095 136592 129638 66840
M21 169008 105744 105091 104999 100811 72806
M22 173211 128590 127405 127632 119163 50332
M23 222669 176956 175712 175855 168136 121634
M24 253965 189089 187971 188081 179780 109824
M25 271688 197873 196825 196877 190451 110410
M26 214020 176747 175353 175750 165288 66787
M27 169323 121260 120503 120702 115835 77525
M28 98143 73598 73447 73461 72256 45415
M29 168769 114365 113325 113765 106913 57167
M30 79706 47161 46705 46747 43990 32545
M31 216586 174373 173449 173069 166518 115607
M32 245302 198261 195913 196865 180968 71785
M33 252679 174899 174184 174269 169890 114857
M34 165374 115868 115657 115624 113552 90644
M35 295397 241362 239965 240552 231041 117235
M36 179980 123456 122385 122763 116195 64997
M37 137855 88581 87982 88225 84790 55937
M38 221521 173301 171426 171445 157802 71829
M39 237677 165944 164886 164375 154693 68683
M40 138247 107340 106856 106980 103382 59637
M41 74104 57649 56946 57051 53156 22797
M42 250248 216338 215940 215957 212792 126272
M43 201777 165061 162960 163791 150705 76924
M44 194097 159804 158934 159243 153752 76524
M45 154146 115990 114709 115034 107032 67916
M46 98062 74028 73436 73698 69333 34707
M47 151414 114086 113620 113802 110812 68390
M50 160701 124570 124090 124157 121332 93739
M51 172661 125215 124349 124694 118074 78879
M52 171285 110213 109614 109786 104073 68427
M53 78156 57138 56807 56909 54413 30587
M54 207224 157448 157096 157168 155455 124792
M55 117487 89900 89137 89387 84053 39041
M56 150359 115705 115185 115323 112669 68167
M57 185730 147264 146206 146635 140884 83492
M58 168137 130837 130102 130269 124604 61054
M59 128055 103151 102778 102957 101100 60711
M60 115132 85510 84374 84719 76580 41714
M61 268359 219507 218663 218897 211526 120186
M62 152527 112127 111656 111649 107430 66822
M63 189507 143665 140768 141921 126686 74779
M64 164891 126167 125433 125563 120433 71083
M65 214833 135956 135166 135227 129453 93481
M66 67953 19342 19144 19203 17511 12326
M67 24965 19296 19093 19165 18226 14620
M68 368955 283223 281144 281786 268906 175704
M69 221485 167142 165543 165323 149716 70501
M70 237922 166857 165619 165967 155037 67710
M72 247257 198857 197425 198209 192654 140753
M73 241340 195173 194223 194673 188389 123507
M74 259993 199604 198181 198941 191798 124866
M75 447637 365839 361253 362044 336607 139179
M76 92208 59597 59080 59212 55080 30423
M77 261440 205708 203914 204524 191137 98587
M78 269973 222774 222089 222217 217919 164074
M80 232728 185294 182677 183727 169525 92942
M81 221378 177994 175379 176259 161605 88817
M82 87115 37924 37767 37810 36643 23878
M83 258835 193636 193044 193131 189052 139140
M84 100648 71908 71607 71642 69494 30952
M85 402056 309207 307966 308214 298330 186767
M86 72130 25457 25401 25424 25135 20839
M87 228186 180601 179640 179982 173012 90041
M88 135730 58092 57940 57960 56081 31948

Then I performed Assign Taxonomy with eHOMD database formatted for dada2 available here. https://github.com/fconstancias/metabaRpipe/tree/master/databases

Then the problem started. I tried to perform sequence alignment as follows:
seqs <- getSequences(seqtab)
names(seqs) <- seqs # This propagates to the tip labels of the tree
mult <- msa(seqs, method="ClustalW", type="dna", order="input")

This resulted in bad alloc error.

Then I changed to DECIPHER alignment, as follows:

seqs <- getSequences(seqtab)
names(seqs) <- seqs # This propagates to the tip labels of the tree
alignment <- AlignSeqs(DNAStringSet(seqs), anchor=NA)

This ran perfectly and completed in less then 24 hours. I don't know how to check for alignment quality, but I believe that the problem is here (suggestions on how to check alignment quality?).

After this, I performed the following:

phang.align <- phyDat(as(alignment, "matrix"), type="DNA")
dm <- dist.ml(phang.align)

This also completed and the resulting 'dm' object was extremelly large (> 18gb). Until now, the objects had reasonable sizes. Comparing with previous analysis, executed with MSA, this is the first time I had this size of 'dm' object. But it is also possible that this is a feature of this specific data.. I don't know..

Then, when I tried to execute:

treeNJ <- NJ(dm)

I got the bad alloc error again..

Then I decided to revisit the alignment... trimmed the data a little more (398-408), which made MSA alignment executable wth our memory capacity. But the MSA alignment is running for 5 days by now, and I don't know how much time it will take to finish... I have a close deadline and can't wait for this anymore.

I tried to use the DECIPHER alignment to build the tree on RAxML and got the following error:

The RAxML analysis 73345 has terminated : OUT_OF_MEMORY

You can access the result at: https://raxml-ng.vital-it.ch/#/result/73345/code/cdo1GZi6VrB4.

The cause of the analysis failure might be described in the log below:

slurmstepd: error: couldn't chdir to /var/vhosts/[vital-it.ch/raxml-ng/htdocs/api](http://vital-it.ch/raxml-ng/htdocs/api)': No such file or directory: going to /tmp instead slurmstepd: error: couldn't chdir to /var/vhosts/vital-it.ch/raxml-ng/htdocs/api': No such file or directory: going to /tmp instead
slurmstepd: error: Detected 1 oom-kill event(s) in StepId=57947.0 cgroup. Some of your processes may have been killed by the cgroup out-of-memory handler.
srun: error: cpt04: task 0: Out Of Memory
scp: /scratch/local/weekly/raxml/job_73345/sequenceAlignment.fa: Permission denied

Thank you for using the RAxML analysis service provided by SIB.

I think the problem is with DECIPHER alignment or with, maybe, with the FASTA I exported from RStudio. I used the following code to generate fasta:

writeXStringSet(alignment, 'alignment.fa')

My questions are:

Do you detect any preprocessing failures that could be compromising this analysis?
The 'dm' object is really to large or this is a common size for this object? If this is a common size for this object, any suggestions on how to build a phylogenetic tree with this (64gb RAM memory)?
If this is not a common size for 'dm' object, any suggestion on how to solve the DECIPHER alignment problem (if this is really an alignment problem)?

Feel free to ask for any additional information you need.
Hope you can help me with this because I'm getting a little desperate here...

Best wishes.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions