Many users of nanopolish have noticed that it takes quite a lot of compute time to polish a genome, particularly for very deep sequencing runs that are now common with R9.4 flowcells. I promised on twitter that I would work on that and the first set of optimizations have been pushed to the nanopolish git repository. In this short post I’ll explain where one of the bottlenecks was and how we fixed it.
Nanopolish calculates an improved consensus sequence for a draft genome assembly by evaluating candidate edits using a signal-level hidden Markov model. The first stage of the algorithm examines the read alignments to discover where the genome assembly may contain errors. There are two phases to candidate generation. Large edits, like multiple inserted or deleted bases, are harvested from the aligned basecalled reads. We found that single-base edits are often missed in this phase, so a second pass exhaustively tests all possible single base substitutions, insertions and deletions using the HMM. For a genome of length
L with average depth
d this will make about
9Ld calls to the hidden Markov model, which dominates nanopolish’s runtime. Often nanopolish only ends up changing 0.2-1.0% of the assembly so most of the candidate edits are rejected. The new version of nanopolish has a smarter candidate testing algorithm that stops evaluating a candidate change when the log-likelihood ratio computed by the HMM reaches a threshold,
t. This reduces the effect of depth (
d) on runtime as most candidate edits have very poor log-likelihood ratios and are quickly rejected.
The new scoring algorithm uses a threshold of
t = 100 by default. For the less patient users there is a more aggressive option (
--faster, which sets
t = 25) but I want to test this further before making it the default.
The results of this new method are fairly dramatic on my benchmarking data set (E. coli R9.4 data at 300x depth):
|Runtime (CPU hours)
With default parameters 0.7.0 is 7.5x faster than the previous version on this assembly. With the
--faster flag, it is over 10x faster. We have other optimizations planned so I hope we can reduce runtime even further - more hopefully soon.
Update July 05: We’ve made two more improvements in nanopolish 0.7.1 to drop runtime on E. coli down to 91 CPU hours. The first change reduces the length of the event sequence and reference region that are input into the HMM during candidate variant screening. The second change stops trying to improve the consensus when the variant set converges rather than iterating for a fixed number of rounds.