Lowering the cost of nanopolish
Oxford Nanopore’s sequencers measure the disruption in electric current caused by single-stranded DNA moving through the nanopore. The device samples the current six thousand times per second and writes the samples to a FAST5 file. We refer to these measurements as “the raw signal” or “the raw samples”, or simply “the raw”. For the past three years nanopore basecallers have converted the raw samples into segments called “events”, with the boundaries between events roughly corresponding to movements of DNA through the pore (for a discussion of the key concepts behind nanopore data analysis see my slides from this year’s AGBT). After the samples are segmented into events, the basecaller predicts which k-mer was in the pore when the samples for each event were taken. The basecalling results are stored in a new FAST5 file that has a table containing every event and its k-mer label. Nanopolish relies on this “event table” to determine which events correspond to each base in the read - when computing whether a base is methylated or calling a SNP nanopolish uses this base-to-event map to quickly extract the local region of the signal for input into the nanopolish HMM.
Parsing the event table from the FAST5 file allows efficient reanalysis of the signal data however there are a few problems with this approach. First, the event table in the basecalled FAST5 files take up a huge amount of disk space as Matei described in this post. To run nanopolish you must keep the basecalled FAST5s around, at the cost of about 10x more disk space than the raw FAST5s. The second issue is that basecalling is moving away from events and towards calling directly from the raw samples. This is no surprise as Clive Brown has often talked about the limitations of event-based analysis. Last week Oxford Nanopore released Albacore 2.0, which basecalls directly from the raw signal. With raw basecalling the event table nanopolish relies on is no longer in the basecalled FAST5 files. The third problem is that the event table was always a bit fragile. Parsing the event table required walking in lockstep along the sequence of k-mers in the basecalled read and the event table. If there is any difference between the k-mers in these two sequences the parser would break and the read would become unusable. This caused a longstanding, difficult to fix issue when an early version of albacore introduced a subtle difference in how the event table k-mer sequence was encoded. It also made nanopolish incompatible with useful preprocessing tools like porechop as any modifications to the basecalled read would break this relationship and fail.
It was clear that continuing to rely on the event table was not viable so we’ve implemented an entirely new way of loading the signal data into nanopolish. Starting with version 0.8 nanopolish can directly read the raw FAST5 files produced by the sequencer. The key idea behind the new data loader is that the event table is unnecessary - we can use dynamic programming to calculate an alignment between events and the basecalled read that approximates what was contained in the event table. Nanopolish will do this on-the-fly when reading a raw FAST5 file by segmenting the raw samples into events using code from scrappie (provided by Tim Massingham at ONT), then aligning the detected events to the basecalled sequence. This is slightly slower (100 CPU hours to polish E. coli vs 85 when reading the event table) but the increased compute cost is more than offset by the huge savings in disk space.
Nanopolish 0.8 is now available on github. Please have a close look at the README as the workflow for running nanopolish in raw mode is slightly different than 0.7. Previously you would prepare the reads by running
nanopolish extract. With 0.8 you need to run
nanopolish index to build a map from each read to its raw FAST5 file. As always please open an issue if you run into trouble.