last-map-probs

This script reads alignments of DNA reads to a genome, and estimates the probability that each alignment represents the genomic source of the read.

It writes the alignments with "mismap" probabilities, i.e. the probability that the alignment does not represent the genomic source of the read. By default, it discards alignments with mismap probability > 0.01.

Typical usage

These commands map DNA reads to the human genome:

lastdb -m1111110 hu human/chr*.fa
lastal -Q1 -e120 hu reads.fastq | last-map-probs.py -s150 > myalns.maf

Explanation of typical usage

These commands find alignments with mismap probability <= 0.01 and score >= 150 (-s150). The score threshold should be high enough to avoid random, spurious alignments: otherwise, the mismap probabilities will not be reliable. A threshold of 150 is often reasonable. For instance, if we compare 50 bp reads to the human genome, we expect a random alignment with score >= 150 once every few thousand reads.

The lastal command finds alignments with score >= 120 (-e120). This is because the mismap probability of an alignment with score 150 may depend on alignments with score < 150. Setting e = s-30 seems to work well.

(We used to recommend option -d108, but this is no longer necessary.)

Options

-h, --help Print a help message and exit.
-m M, --mismap=M
 Don't write alignments with mismap probability > M.
-s S, --score=S
 Don't write alignments with score < S.

Details

Using multiple CPUs

This will run the pipeline on all your CPU cores:

parallel-fastq "lastal -Q1 -e120 hu | last-map-probs.py -s150" < reads.fastq > myalns.maf

It requires GNU parallel to be installed (http://www.gnu.org/software/parallel/).

Limitations

Method

Suppose one query sequence has three alignments, with scores: s1, s2, s3. The probability that the first alignment is the one that reflects the origin of the query, is:

exp(s1/t) / [exp(s1/t) + exp(s2/t) + exp(s3/t)]

Here, t is a parameter that depends on the scoring scheme: it is written in the lastal header.

Reference

For more information, please see this article:

Incorporating sequence quality data into alignment improves DNA read mapping. Frith MC, Wan R, Horton P. Nucleic Acids Research 2010 38:e100.