Supporting R9 data in nanopolish
Two weeks ago I mentioned on twitter that I pushed a large series of changes to support R9 Oxford Nanopore data in nanopolish. I promised a blog post describing the changes then went off on my summer vacation - here it is, a little late.
First, some background about R9 data. Back in March Clive Brown held a google hangout to announce that ONT was switching to a new pore. The old pore, dubbed R7, would be discontinued and a new pore based on the CsgG lipoprotein, called R9, would take its place. A host of other changes came with the pore swap; the sequencing speed would be increased (from 70bp/s to 250bp/s) and the ONT basecaller would now use a recurrent neural network rather than a hidden Markov model. Supporting this data would require quite a few changes to nanopolish - we would need a new pore model (a set of parameters for the Gaussian emission distributions in our HMM), new transition probabilities for the model (due to the increased speed) and we could no longer rely on extracing the parameters that scale a read to the pore model from the FAST5 files as the RNN does not calculate these.
The R9 pore model was the simplest issue to resolve - ONT provided a new table mapping k-mers to Gaussian parameters. We then wrote code to infer the per-read scaling parameters from the output of the RNN, which assigns 5-mer labels to events. Nick and Josh generated some early R9 data and I worked on updating the new transition parameters in the HMM. The initial polishing experiments went ok - we got up to about 99.7% accuracy - but fell short of our best R7 accuracy, which was about 99.8%. After I returned from London Calling I spent a painful few weeks looking at regions of the E. coli genome that were miscalled. I eventually realized that the change in sequencing speed had a bigger effect than I realized - rather than just changing the transition parameters of the HMM I would have to change the structure as well. The key issue is that R9 events are oversegmented (there are more events than base movements through the pore) rather than undersegmented, like in R7. The more aggressive segmentation makes the events more susceptible to transient noise causing their level to be drastically different than the model. To fix this issue we added a new state to the HMM which would allow it to completely ignore an event. The transitions to this state have very low probability to avoid allowing the HMM to ignore many events.
In the midst of all of these model changes I rewrote the consensus algorithm. The original consensus algorithm had a rather clunky system for keeping track of where the reads aligned to the assembly. This system made development and debugging tedious so I merged the consensus functionality with the much better code that we wrote to call SNPs for Ebola. The README on github contains instructions for using the new consensus workflow.
R9 Polishing Results
ONT provided my lab with a batch of the latest flowcells. Phil Zuzarte broke our lab record for throughput by generating almost 1Gbp of 1D reads and 450 Mbp of 2D reads for E. coli K12. We used the latest version of canu to generate a draft assembly of this data and computed a new consensus using the updated nanopolish. As before we evaluated our assembly using nucmer which reported 99.96% identity to the K12 reference. This is a remarkable improvement from our first polishing run from January 2015, which only reached 99.48% identity. As an indication of just how much the data has improved in the last year, the unpolished canu assembly from this run has better accuracy (99.66%) than that first polished assembly.
Clive suggested we try to mix R7 and R9 data to jointly call a consensus, which bumped identity up to 99.98%. This hints that mixing data with different signal properties - might be a direction to explore.
Homopolymers
These results are encouraging but we want to improve accuracy further. Most of the remaining errors are (predictably) in homopolymer runs. Interestingly we are able to call some longer homopolymers (7,8,9 bp runs) correctly now using the new HMM (which does not penalize homopolymers so if there are multiple events for a homopolymer they will not be collapsed to a single state). However accuracy for long homopolymers remains the biggest problem. To address this we have started to explore explicitly modeling the duration of events. This should be a major focus in the upcoming months.
So far we have mostly tested the new algorithm on E. coli data - if anyone else has a high-depth R9 data set and would like to give the new code a try we would love to hear from you.
Thanks to Matei David, Jonathan Dursi, Nick Loman, Josh Quick and Phil Zuzarte for helping develop the R9 support and generating data. The scientists at Oxford Nanopore helped us better understand R9 data and the discussions we had with them were key to implementing the improved HMM - thanks in particular to Clive and Chris Wright.
Disclosure: Oxford Nanopore provides research funding to my lab.