You are here
BMTagger
Table of Contents
Introduction
Best Match Tagger (BMTagger) is an efficient tool that discriminates between human reads and microbial reads without doing an alignment of all reads to the human genome.
Running Instructions
Instructions (README) at the software download site are partially repeated here, but, not all of them directly apply on WestGrid systems. For example, blastn is already available on many WestGrid systems so doesn't have to be downloaded.
Steps done once per reference genome are: 1. Make index for bmfilter using command bmtool -d -o -w 18 where reference.fa is fasta file for the screening database. For HMP, this can be the human genome. Output is a binary file generated in reference.bitmask To make a compressed index, add flag "-z" to the above command line. 2. Make index for srprism srprism mkindex -i -o -M 7168 This generates files with prefix reference.srprism 3. Make blastdb for blast makeblastdb -in -dbtype nucl This generates database files for blastn. makeblastdb and blastn can be downloaded from blast distribution: ftp://ftp.ncbi.nlm.nih.gov/blast/executables/release/LATEST/ Commands for running the tagger depend on the data source for reads. - For single reads in fasta format, the command is: bmtagger.sh -b reference.bitmask -x reference.srprism -T tmp -q 0 -1 -o - For paired reads in fasta format, the command is: bmtagger.sh -b reference.bitmask -x reference.srprism -T tmp -q 0 -1 -2 -o - For single reads in fastq format, the command is: bmtagger.sh -b reference.bitmask -x reference.srprism -T tmp -q 1 -1 -o - For paired reads in fastq format, the command is: bmtagger.sh -b reference.bitmask -x reference.srprism -T tmp -q 1 -1 -2 -o - For reads read directly from SRA, the command is: bmtagger.sh -b reference.bitmask -x reference.srprism -T tmp -A -o Note that the application assumes that if a run is specified,
then the SRA structure has been installed as per the directions in SRA toolkit.
In all above scenarios, -b, -x, and -T specify the index for bmfilter,
index for srprism, and directory to use for temporary files
If no temporary directory is specified, current working directory is used.
Flag -q of 0 and 1 specify fasta and fastq input files, respectively.
Output specified by -o is a file name if input is fasta or fastq,
and it is a directory if the input is a run. The output for, say run SRR059480,
when -o is my_results will be a file my_results/SRR059480.blacklist
that contains the SRA indexes of reads found to be human rather than the full id. Output files with inputs as fasta or fastq contain the ids of reads found to be human. If input files are fasta or fastq files, one can instead of printing
the list for reads that are tagged as human, get the reads in the original
input format that are NOT tagged as human by addign flag -X. See README.bmtagger.txt, README.bmfilter.txt and README.bmtool.txt for more details. For systems with free RAM 3GB-9GB comressed bitmask format is preferred.
If available RAM size is more than 9GB, uncompressed bitmask will work faster.
Still to produce compressed or uncompressed bitmask one needs 9GB available memory.