Aligning Nanopore Events to a Reference
Introduction
This post describes a new module I added to our nanopolish software package that aligns the signal data emitted by a nanopore to a reference genome. This is in contrast to most approaches which align two DNA sequences to each other (for example a base-called read and a reference genome). To make sense of what aligning signal data to a reference genome means, I will describe at a high level my model of how nanopore sequencing works. For a more detailed and technical description, see the supplement of our preprint.
Nanopore Sequencing
A nanopore sequencer threads a single strand of DNA through a pore embedded in a membrane. The pore allows electric current to flow from one side of the membrane to the other. As DNA transits this pore it partially blocks the flow of current, which is measured by the instrument. In Oxford Nanopore’s MinION system the measured current depends on the 5-mer that resides in the pore when the measurements are taken.
The MinION samples the current thousands of times per second; as 5-mers slide though the pore they should be observed in multiple samples. The MinION’s event detection software processes these samples and tries to detect points where the current level changes. These jumps indicate a new 5-mer resides in the pore. To help illustrate this I’ve reproduced a figure from our preprint below:
This is a simulation from an idealized nanopore sequencing process. The black dots represent the sampled current and the red lines indicate contiguous segments that make up the detected events. For example the mean current was around 60 picoamps, plus a bit of noise, for the first 0.5s. The current then dropped to 40 pA for 0.1s before jumping to 52 pA and so on.
The event detection software writes the events to an HDF5 file. The raw kHz samples are typically not stored as the output files would be impractically large. Here’s the table of events for this simulation:
event index | mean (pA) | length (s) |
---|---|---|
1 | 60.3 | 0.521 |
2 | 40.6 | 0.112 |
3 | 52.2 | 0.356 |
4 | 54.1 | 0.051 |
5 | 61.5 | 0.291 |
6 | 72.7 | 0.015 |
7 | 49.4 | 0.141 |
To help translate events into a DNA sequence, Oxford Nanopore provides a pore model which describes the expected current signal for each 5-mer. The pore model is a set of 1024 normal distributions - an example might look like this:
5-mer | \(\mu_k\) | \(\sigma_k\) |
---|---|---|
AAAAA | 53.5 | 1.3 |
AAAAC | 54.2 | 0.9 |
… | … | … |
TTTTG | 65.3 | 1.8 |
TTTTT | 67.1 | 1.4 |
This indicates that the measured current is expected to be drawn from \(\mathcal{N}(53.5, 1.3^2)\) when AAAAA is in the pore and so on.
Inferring Bases from Events
Using the pore model and the observed data we can solve a number of inference problems. For example we can infer the sequence of nucleotides that passed through the pore. This is the base calling problem. We can also infer the sequence of the genome given a set of overlapping reads. This is the consensus problem, which we addressed in our paper.
These inference problems are complicated by two important factors. First, the normal distributions for 5-mers overlap. There are 1024 different 5-mers but the signals typically range from about 40-70 pA. This can make it difficult to infer which 5-mer generated a particular event. This is partially mitigated by the structure of the data; a solution must respect the overlap between 5-mers so a position that is difficult to resolve may become clear when we look at subsequent events. Second, event detection is performed in real time and inevitably makes errors. Some events may not be detected if they are too short or if the signals for adjacent 5-mers of the DNA strand are very similar. The extreme case for the latter situation occurs when sequencing through long homopolymers - here we do not expect a detectable change in current. The opposite problem occurs as well. The event detector may split what should be a single event into multiple events due to noise in the system that looks like a change in current. Handling these artefacts is key to accurately inferring the DNA sequence that generated the events.
Aligning Events to a Reference
The hidden Markov model we designed for the consensus problem had 5-mers of a proposed consensus sequence as the backbone of the HMM, with additional states and transitions to handle the skipping/splitting artefacts. In our preprint we used this HMM to calculate a consensus sequence from a set of reads. If we make a reference genome the backbone of the HMM, we can use it to align events to the reference.
The new eventalign
module of nanopolish
exposes this functionality as a command line tool. This program takes in a set of nanopore reads aligned in base-space to a reference sequence (or draft genome assembly) and re-aligns the reads in event space.
The pipeline uses bwa mem
alignments as a guide. We start with a normal bwa workflow:
bwa mem -x ont2d -t 8 ecoli_k12.fasta reads.fa | samtools view -Sb - | samtools sort - alignments.sorted
samtools index alignments.sorted.bam
We can then realign in event space using nanopolish:
nanopolish eventalign -r reads.fa -b alignments.sorted.bam -g ecoli_k12.fasta "gi|556503834|ref|NC_000913.3|:10000-20000" > eventalign.tsv
The output looks like this:
contig position reference_kmer read_index strand event_index event_level_mean event_length model_kmer model_mean model_stdv
gi|556503834|ref|NC_000913.3| 10000 ATTGC 1 c 27470 50.57 0.022 ATTGC 50.58 1.02
gi|556503834|ref|NC_000913.3| 10001 TTGCG 1 c 27471 52.31 0.023 TTGCG 51.68 0.73
gi|556503834|ref|NC_000913.3| 10001 TTGCG 1 c 27472 53.05 0.056 TTGCG 51.68 0.73
gi|556503834|ref|NC_000913.3| 10001 TTGCG 1 c 27473 54.56 0.011 TTGCG 51.68 0.73
gi|556503834|ref|NC_000913.3| 10002 TGCGC 1 c 27474 65.56 0.012 TGCGC 66.96 2.91
gi|556503834|ref|NC_000913.3| 10002 TGCGC 1 c 27475 69.97 0.071 TGCGC 66.96 2.91
gi|556503834|ref|NC_000913.3| 10003 GCGCT 1 c 27476 67.11 0.017 GCGCT 68.08 2.20
gi|556503834|ref|NC_000913.3| 10004 CGCTG 1 c 27477 69.47 0.052 CGCTG 69.84 1.89
This is the complement strand (c) of a 2D Nanopore read from Nick’s E. coli data aligned to E. coli K12. The first event listed (event 27470) had a measured current level of 50.57 pA. It aligns to the reference 5-mer ATTGC at position 10,000 of the reference genome. The pore model indicates that events measured for 5-mer ATTGC should come from \(\mathcal{N}(50.58, 1.02^2)\), which matches the observed data very well. The next 3 events (27471, 27472, 27473) are all aligned to the same reference 5-mer (TTGCG) indicating that the event detector erroneously called 3 events where only one should have been emitted. Note that the current for these 3 events are all plausibly drawn from the expected distribution \(\mathcal{N}(51.68, 0.73^2)\).
This output has one row for every event. If a reference 5-mer was skipped, there will be a gap in the output where no signal was observed:
gi|556503834|ref|NC_000913.3| 10009 GCACC 1 c 27489 67.52 0.028 GCACC 66.83 2.46
gi|556503834|ref|NC_000913.3| 10011 ACCGC 1 c 27490 65.17 0.012 ACCGC 65.03 1.92
Here we did not observe an event for the 5-mer at position 10010.
This module will hopefully make it easier to work with signal-level nanopore data, and help the development of improved models. The eventalign
module can be found in the latest version of nanopolish.