As part of the work of the Pancancer variant-calling working
group, we needed to merge the
results of variant calls from a wide range of different packages to compare
their results and select interesting sites for lab validation. This is a
more subtle procedure than it sounds, and we could not find any one
place where all the necessary information was documented, so we wrote up
our process here. Code that implements this part of our analysis pipeline
can be found on GitHub.
Simple Variants - SNVs, Indels
The standard format used to output variants is the Variant Call Format. For SNVs and shortish indels (insertions or deletions), this works very well. Each entry in a VCF file contains the location of the variant (the chromosome it occurs on, and the starting position), the relevant excerpt of the reference sequence starting at that position, and the “alternate” sequence – the variant sequence that has been found there instead.
There are other fields that we will come back to; but a VCF file
containing an A → G SNV at chromosome 1, position 100, a
3 base-pair deletion at chromosome 2, position 200, and a 5 basepair
insertion at chromosome 3, position 300 would look something like this:
Even this admits some ambiguity; for instance, a deletion of 3 As within a
homopolymer run of 10 of them could reasonably be called at any of 8 positions;
and more complex substitions can be equally described as one large variant, or
as a combination of insertions, deletions, and substitions. To break these
ambiguities, there is a well understood
process, which requires looking at both the reference genome and the VCF file.
It is fairly straightforward, implemented by several
tools, and we perform this step (using bcftools
ingesting the submitted calls from various groups.
Once these sorts of calls are normalized, they are fairly easily merged and compared. We used python for this project, and we used Jared’s code from an earlier pilot for this – using a dictionary of dictionaries, where the first key was a (chromosome,start position) tuple. The value in the first dictionary corresponding to that key was then a dictionary of (reference allele, variant allele), with a value that was a list of callers that had made that call. So querying existance of a variant was just two (at most) dictionary lookups; and registering that a caller made a particular call was two dictionary lookups and a list append, possibly creating dictionaries and lists along the way if this was the first time these entries were seen.
Performing the same task with structural variants (henceforth SVs) was more complicated for two reasons:
Location - by nature of how SVs are detected, there is quite often uncertainty in the position of called SVs, making comparison more complicated; and
Description - there are a large number of ways that callers actually call SVs, and they need to be converted to some common representation for comparison.
Structural variants are generally observed differently than, say, SNVs. An alignment-based method might align a read to a reference, find that the alignment unambigiously implies a mismatch, and so call an SNV, as in the top panel of the figure below; there is no question about the location of the variant. (Obviously, to call the variant with any confidence would require multiple reads, not shown in the figure for clarity.)
However, a structural variant – say a translocation, as in the bottom panel of the figure – will likely be found differently. In the diagram, paired ends of a read map to two different chromosomes unambigiously, but there is ambiguity as to the location of the two breakpoints on the chromosomes which are now joined to form an new adjacency. Information from other reads will reduce the uncertainties, even to zero; but in the general case one has to handle ambiguity in location when deciding which calls to merge.
Many callers provided confidence intervals on locations of the breakpoints in their calls, which could be used for constraining the merging of calls. However, not all callers did; and there was, naturally enough, significant variation in the confidence intervals, even for a given variant. Thus we took a simpler approach, merging using a fixed window size. Two calls, where each breakpoint of one call was less than the window size away from the corresponding breakpoint of the other call, were merged.
We used a fixed window size of 300bp, of the order of the insert size observed
in the libraries; in principle, this is a more agressive merging strategy than using the confidence intervals when available (which were typically much less), but in practice, there were very few pairs of calls where the larger window size would have mattered.
The most efficient way to look up possible matching locations within some
window would be to use a spatial data structure such as an interval
tree; however, since our window
size was relatively modest, and we are constrained to integer positions, I
dealt with uncertainty in locations in location lookups by brute-force; I
simply search every integer position within the window. For efficiency, and to
ensure the closest match is found, the search is by distance (pos+0, pos±1, …). Because dictionary lookup in python is so fast, this approach is “fast enough” for our purposes, in that this is not the rate limiting step for the merging pipeline.
A bigger challenge – not conceptually, but in terms of bookkeeping and corner cases – is simply the diversity of ways in which calls are labelled in VCF files by various callers.
Breakpoint notation (SVTYPE=BND)
The VCF Standard describes two ways of describing SVs in a VCF file, of which there are numerous small variations out in the field. The second, described in somewhat more detail, is breakend notation (usually labelled with an SVTYPE=BND entry in the INFO field of a record). Before we go into this in much detail, let’s look at a figure describing the possibilties:
To describe a structural variant, we need more than just two positions, one per breakend. The breakends aren’t just points; they can be thought of as half-intervals, (eg, the piece of Chromosome 1 leading up to position 500; or the piece of Chromsome 1 starting at and continuing from position 500), and we will need that directional information.
(In the VCF spec, from the diagrams it is pretty clear that these are closed intervals; eg, regardless of the direction of the half-interval, the interval includes position 500. It is not 100% clear to me that every caller that uses BND notation honours that convention, but since we’re using 300-bp windows, such off-by-one errors in location need not concern us here.).
Once that is established, there is one more piece of information which need to be established to define the adjacency between breakends A and B; it can be thought of in two equivalent ways. One is the relative orientation; is half-interval A joined after half-interval B, or before? The second is the (relative) strandedness. Are the two half-intervals joined in such a way that the same strands are joined, or are the strands reversed, as in an inversion?
In VCF BND notation, for consistency with other entries but perhaps somewhat confusingly, only the chromosome and position of the first breakpoint are listed, and then in the ALT field is encoded:
The other chromosome and position of the other breakend (eg, 1:800 in the diagrams above)
The orientation of the second breakend; whether this is the half-interval which starts at the given position and extends rightwards, [1:800[, or extends from the left and ends at the position given, ]1:800].
Finally, the relative orientation of the first breakpoint and its direction is specified using the position of other bases relative to the second breakpoint.
In the examples in the figure, ‘N’ is given in the REF field; the position of that N relative to the ]1:800] or [1:800[ tells you whether the breakpoint involving 1:500 is joined before or after.
If after, it is the half-interval including that position and extending rightwards, other it is the breakpoint including that position extending to the left.
That combined with the relative orientation of the second interval is enough to establish the adjacency.
The N, of course, could be a real base, if known; and the corresponding bases listed in the ALT field could be one or more different bases, as would be the case if a substitution occurred at or insertion occured between the two half-intervals being joined at the new adjacency.
Note that the adjacency can be described equally well from either side; that is, for the purposes of identifying the new adjacency we could just as easily list 1:800 first:
1 500 . N N[1:800[
1 800 . N ]1:500]N
1 500 . N ]1:800]N
1 800 . N N[1:500[
1 500 . N [1:800[N
1 800 . N [1:500[N
1 500 . N N]1:800]
1 800 . N N]1:500]
When flipping the order of the breakpoints, we either do nothing else (in the case of adjacencies with different relative strandedness/orientation) or also flip position and direction (in the case of adjacencies with the same relative orientation). Otherwise, we end up describing a different novel adjacency involving the same two positions.
Because we want to merge equivalent adjacencies, and some callers call an adjacency from both sides and others just from one, we normalize the order by always listing the lexicographically first breakpoint position first.
Symbolic Notation Translocations (<TRA>)
The other way given for describing a structural variant is with “symbolic notation”; listing meaningful tags like <INV> to describe an inversion, or <DUP> to describe a duplication, in the ALT field of a VCF record.
The most general of these is the translocation symbolic call, <TRA>, which isn’t covered in the VCF standard, which essentially works the same way as a BND-style call. However, since the ALT field is now occupied, we need some other way to put the information about the position of the second breakpoint, and the relative orientation of the two half-intervals.
The other position is listed in the INFO field, with the chromosome given in an entry called CHR2, and the position in an entry called END. That leaves the relative orientation with which they are joined.
Although again this doesn’t appear to be covered in the VCF specification, the
convention is to use a “connection” field, CT, to indicate whether the 5’ or the 3’ end of the interval involving the first breakpoint position is connected to the 5’ or 3’ end of that involving the second. This gives the directions of both half-intervals and their connection, which is enough to define the adjacency.
We can then translate between BND notation and translocation calls:
<TRA> with CT INFO field
1 500 . N N[1:800[
1 500 . N <TRA> ... CHR2=1;END=800;CT='3to5'
1 500 . N ]1:800]N
1 500 . N <TRA> ... CHR2=1;END=800;CT='5to3'
1 500 . N [1:800[N
1 500 . N <TRA> ... CHR2=1;END=800;CT='5to5'
1 500 . N N]1:800]
1 500 . N <TRA> ... CHR2=1;END=800;CT='3to3'
Note again that the adjacencies can be described from either side. To flip these calls is fairly straightforward; the ends of the intervals (5’ or 3’) belong to their respective breakpoint positions, so flipping the breakpoints just means flipping the positions, and the first and second number in the CT record. So 3to3 and 5to5 remain the same, while 3to5 becomes 5to3 and vice versa.
Clearly, both calls and BND-style calls are equivalent; we output BND-style entries for conciseness, but either would work.
Higher-level Symbolic Calls (<DEL>, <INV>, <DUP>)
Higher-level calls have to be handled a little differently. Because some
callers may call them using BND-style calls, or even <TRA> calls, they have to be decompose into the lowest common denominator - individual adjacencies.
In the figure below are simple examples of a deletion, an inversion, and a duplication.
The deletion and duplication each generate only one novel adjacency (the 3to5 adjacency between Chr1:10 and Chr1:11 isn’t novel!), wheras the inversion generates 2.
Converting these into BND-style descriptions of the adjacencies follows below.
As BND call(s)
1 10 . N <DEL> ... END=20;
1 10 . N N[1:21[
1 10 . N <INV> ... END=20;
1 10 . N N]1:20]
1 11 . N [1:21[N
1 1 . N <DUP> ... END=10;
1 1 . N ]1:10]N
Breaking up calls like inversions into multiple calls proved important for understanding the comparison of results; we found several cases where some callers called an inversion, whereas others only identified one-half of the event.
While the above points cover most of the cases, the rather loose standards in this area cause a number of small bookkeeping headaches.
While callers often specify CT info fields for symbolic calls, they typically aren’t necessary (and are sometimes inconsistent with the intent of the caller, so suggesting an inverted duplication where a simple duplication is meant, or turning a deletion into a strange translocation, so we strip them out rather than interpreting them.
Similarly, some callers make BND-like calls but use SVCLASS info fields to label the call as a higher-level symbolic call, sometimes then leaving out other information which must then be inferred from the SVCLASS; these then have to be interpreted partly as BND-style calls and partly as symbolic calls.
We are consider structural variants to be equal for the purposes of merging if they represent a novel adjacency between two “equivalent” breakends. We do not distinguish between calls that (for instance) have insertions called at the new adjacency.
As with the simple variants, these variants are stored as a python dictionary-of-dictionaries. Each dictionary is a “locationdict”, taking as a key a (chromosome, position, direction/strand) pair, and querying a key will match any position within the windowsize of that location. The first key is the (lexicographically first) breakpoint location and direction/strand, and the second is the that of the second. The final value is a list of callers making the call, and their VCF records for comparison.
By including direction/strand information, we avoid the issue of merging two breakpoints with positions close but indicating opposite half-intervals. So for instance, in the inversion example above, ]1:20] and [1:21[ are clearly very close together in their starting locations; but they must not be merged, as they are entirely non-overlapping regions.
Identifying common SV calls between two different callers with different conventions for outputting VCF records isn’t a deep algorithmic problem, but lack of consistent documentation makes it more challenging than we had originally anticipated. We hope that this writeup helps others who are need to do something similar.
Last week, Titus Brown asked a question on twitter and github which spurred a lot of discussion – what was the best way to randomly read a subset of entries in a FASTQ file? He and others quickly mentioned the two basic approaches – randomly seeking around in the file, and streaming the file through and selecting at random – and there were discussions of the merits of each in both forums.
This workload is a great case study for looking some of the ins and outs of I/O performance in general, and the tradeoffs between streaming and random file access in particular. The results can be a little surprising, and the exact numbers will necessarily be file system dependant: but on hard drives (and even more so on most cluster file systems), seeks will perform surprisingly poorly compared to streaming reads (the “Reservoir” approach in the plot below):
and here we’ll talk about why.
Anatomy of a File System
Unlike so many other pieces of software we have to deal with daily, the file system stack just works, and so we normally don’t have to spend much time thinking about what happens under the hood; but some level of familiarity with the moving parts in there and how they interact helps us better understand when we can and can’t get good performance out of that machine.
IOPS vs Bandwidth
A hard drive is a physical machine with moving parts, and the file system software stack is built around this (even in situations where this might not make sense any more, like with SSDs - about which more later). Several metal platters are spinning at speeds from 7,200 to 15,000 revolutions per minute (a dizzying 120-500 revolutions per second), and to start any particular read or write operation requires the appropriate sector of the disk being under the read heads; an event that won’t happen, on average, until 1 to 4 milliseconds from now.
Both the drive controller hardware and the operating system work hard to maximize the efficiency of this physical system, re-arranging pending reads and writes in the queue to ensure that requests are processed as quickly as possible; this allows one read request to “jump the queue” if the sector it needs to read from is just about to arrive, rather than having it wait in line, possibly for more than one disk rotation. While this can greatly help the throughput of a large number of unrelated operations, it can’t do much to speed a single threaded program’s stream of reads or writes.
This means that, for physical media, there is a limit to the number of (say) read I/O Operations Per Second (IOPS) that can be performed; the bottleneck could be at the filesystem level, or the disk controller, but it is normally at the level of the individual hard drive, where the least can be done about it. As a result, even for quite good, new, hard drives, a typical performance might be say 250 IOPS.
On the other hand, once the sector is under the read head, a lot of data can be pulled in at once. New hard disks typically have block sizes of 4KB, and all of that data can be slurped in essentially instantly. A good hard disk and controller can easily provide sequential read rates (or bandwidth) of over 100MB/s.
Prefetching and Caching, or: Why is Bandwidth so good?
You, astute reader, will have noticed that the numbers in that sentence above don’t even come close to working out. 250 IOP/s times 4KB is something like 1 MB/s, not 100MB/s. Where does that extra factor of 100 come from?
Much as the operating system and disk controller both work to schedule reads and writes so that they are collectively completed as quickly as possible, the entire input/output stack on modern operating systems is built to make sure that it speeds up I/O whenever possible – and it is extremely successful at doing this, when the I/O behaves predictably. Predictable, here, means that there is a large degree of locality of reference in the access; either temporal locality (if you’ve accessed a piece of data recently, you’re likely to access it again), or spatial (if you’ve accessed a piece of data, you’re likely to access a nearby piece soon).
Temporal locality is handled through caching; data that is read in is kept handy in case it’s needed again. There may be block caches on the disk drive or disk controller itself; within the Linux kernel there is a unified cache which caches both low-level blocks and high level pages worth of data (the memory-mapped file interface ties directly into this). Directory information is cached there, too; you may have noticed that in doing an ls -l in a directory with a lot of files, the first is much slower than any followups.
In user space, I/O libraries often do their own caching, as well. C’s stdlib, for instance, will cache substantial amounts of recently used data; and so, by extension, will everything built upon it (iostreams in C++, or lines of data seen by Python). The various players are shown below in this diagram from IBM’s DeveloperWorks:
None of this caching directly helps us in our immediate problem, since we’re not intending to re-read a sequence again and again; we are picking a number of random entries to read. However, the entire mechanism used for caching recently used data can also be used for presenting data that the operating system and libraries thinks is going to be used soon. This is where the second locality comes in; spatial locality.
The Operating System and libraries make the reasonable assumption that if you are reading one block in a file, there’s an excellent chance that you’ll be reading the next block shortly afterwards. Since this is such a common scenario, and in fact one of the few workloads that can easily be predicted, the file system (at all levels) supports quite agressive prefetching, or read ahead. This basic idea – since reading is slow, try to read the next few things ahead of time, too – is so widely useful that it is used not just for data on disk, but for data in RAM, links by web browsers, etc.
To support this, the lowest levels of the file system (block device drivers, and even the disk and controller hardware) try to lay out sequential data on disk in such a way that when one block is read, the next block is immediately ready to be read, so that only one seek, one IOP, is necessary to begin the read, and then following reads happen more or less “for free”. The higher levels of the stack take advantage of this by explicitly requesting one or many pages worth of data whenever a read occurs, and presents that data in the cache as if it had already been used. Then this data can be accessed by user software without expensive I/O operations.
The effectiveness of this can be seen not only in the factor of 100 difference in streaming reads (100MB/s vs 4KB x 250 IOP/s), but also in how performance suffers when this isn’t possible. On a hard drive that is nearly full, a new file being written doesn’t have the luxury of being written out in nice contiguous chunks that can be read in as a stream. The disk is said to be fragmented, and defragmentation can often improve performance.
In summary, then, prefetching and caching performed by the disk, controller, operating system, and libraries can speed large streaming reads on hard disks by a factor of 100 over random seek-and-read patterns, to the extent that, on a typical hard drive, 100-400KB can be read in the time that it takes to perform a single seek. On these same hard drives, then, you might expect streaming through a single 1000MB file to take roughly as long (~10s) as 2,500-4,000 seeks. We’ll see later that considering other types of file systems - single SSDs, or clustered file systems - can change where that crossover point between number of seeks versus size of streaming read will occur, but the basic tradeoff remains.
The Random FASTA Entry Problem
To illustrate the performance of both a seeking and sequential streaming method, let’s consider a slightly simpler problem than posed. To avoid the complications with FASTQ, let’s consider a sizeable FASTA file (we take est.fa from HG19, slightly truncated for the purposes of some of our later tests). The final file is about 8,000,000 lines of text, containing some 6,444,875 records. We consider both compressed and uncompressed versions of the file.
We’ll randomly sample records from this file – 0.1%, 0.2%, 0.5%, 1%, 2%, 5%, 10%, 20%, and 50% of the total number of records – run for several different trials, and done a few different ways. We’ll consider a seeking solution and a sequential reading solution.
The code we’ll discuss below, and scripts to reproduce the results, can be found on GitHub. The code is in Python, for clarity.
A Seeking Solution
Coding up a solution to randomly sample the file by seeking to random locations is relatively straightforward. We generate a random set of offsets into the file, given the file’s size; then seek to each of these locations in order, find and read the next record, and continue.
There are a few things to note about this approach:
I’ve sorted the random locations to improve our chances of reading nearby locations sequentially, letting the operating system help us when it can.
The file size must be known before generating the locations. This means we must have access to the whole file; we can’t just pipe a file through this method
Some compression methods become difficult to deal with; one can’t just move to a random location in a gzipped file, for instance, and start reading. Other compression methods - bgzip, for instance, make things a little easier but still tricky.
The method is not completely uniformly random if the records are of unequal length; we are more likely to land in the middle of a large record than a small one, so this method is biased in favour of records following a large one.
Because we are randomly selecting locations, we may end up choosing the same record more than once; this gets more likely as the fraction of records we are reading increases. In this case, we go back and randomly sample records to make up the difference. There’s no way to know in general if two locations indicate the same record without reading the file at those locations.
A Streaming Solution - Reservoir Sampling
A solution to the -random-sampling problem which makes use of streaming input is the Reservoir Sampling approach. This elegant method not only takes advantage of streaming support in the file system, but doesn’t require knowledge of the file size ahead of time; it is a so-called ‘online’ method.
The basic idea is that, for every th item seen, it is selected with a probability of . Because there’s a chance of it being bumped by the next item, then the probability of the th item being selected by the end of round is
and so on until by the time all items are read, each item has a chance of being selected.
Our simple implementation follows; we select the first items to fill the reservoir, and then randomly select through the rest of the file.
Note that this method does uniformly select records, and can work in pipes or through compressed files equally well as it passes sequentially through the entire file.
Timing Results: Workstation Hard Drive
I ran these benchmarks on my desktop workstation, with a mechanical hard drive. As a quick benchmark for streaming, simply counting the number of lines of the uncompressed (a shade under 4GB) file takes about 25 seconds on this system, which gives us a sense of the best possible streaming time for the file; this is about 160MB/s, a reasonable result (and in fact slightly higher than I would have expected). Similarly, if we expected an IOPS rate of about 400, then we’d expect to see 0.1% selection to take about 6445/400 ~ 16s.
The file handling and parsing will be significantly slower in python than it would be in C++, which disadvantages the streaming approach (which must process many more records than the seeking approach) somewhat, but our results should be instructive regardless.
The primary results are shown in this plot, which we have already seen (note that both x and y scales are logarithmic):
Some basic takeaways:
The reservoir approach, which always has to pass through the entire file, is much less variable in time than the seeking approach.
For sampling less than 0.75% of the file, seeking is clearly and reliably faster; for greater sampling fraction, seeking may or may not be faster
At very large fractions, the seeking time blows up as the chance of “collisions” - selecting the same entry multiple times - greatly increases, meaning you have to go back and resample. But no one would really suggest this approach for sampling more than 10% of the file anyway.
Reservoir sampling works roughly equally well if it is operating directly on the file or having it piped through; and it actually can be somewhat faster for a gzipped file with zcat, since less data actually has to be pulled from the disk.
We also tried the same test on gzipped files directly, since Python’s gzipped file access has a seek operation you can in principle use; but this isn’t really a fair test, as you can’t properly seek through a gzipped file, you have to decompress along the way. That means the “seeking” approach is really just a streaming approach implemented much less efficiently, and we see that quite clearly:
It was because the file was truncated that we could use gzip with seek-based sampling here at all; seek-sampling requires knowing the filesize, and the total (uncompressed) file size isn’t available with a gzipped file unless the uncompressed size is less than 4GB.
Sidebar - Benchmark warning: Clear that cache!
Note that for benchmarking I/O, even for moderately large files like this one (~1+GB compressed, 4GB uncompressed), a significant amount of the file will remain in various levels of OS cache, so it is absolutely essential to clear or avoid the cache in subsequent runs or else your timing results will be completely wrong.
On linux, as root, you can completely clear caches with the following:
but this is rather overkill, and requires root. Easier, but requiring more disk space (and a file system which is not too smart/aggressive about deduplication!) is to cycle between multiple copies of the same file. The getdata script makes 5 copies each of the compressed and uncompressed est.trunc.fa file to cycle through, which may or may not be enough.
Other File Stores: Cluster File Systems
Of course, single spinning disk performance isn’t all that a bioinformatician cares about. Two other types of file systems play a large role; file systems on shared clusters, and individual SSDs.
On a cluster file system, data is stored on a large array of spinning disks. This has performance advantages, and disadvantages.
On the plus side, file systems like Lustre will often stripe large files across several disks and even servers, so that streaming reads can be served from many pieces of hardware at once – potentially offering many multiples of 100MB/s of streaming bandwidth. Similarly, the data file being striped across multiple disks means that many multiples of the IOPS are in principle available.
In practice, because so many parts (servers, disks) need to be coordinated to perform operations, there is often an additional latency to file system operations; this tends to come through as modestly fewer effective IOPS than one would expect. This means that the IOPS, while still potentially much higher than a local disk, are proportionally less increased than the bandwidth, tilting the balance in favour of streaming over seeking.
On the other hand, having a network-attached file system introduces another potential bottleneck; a slow or congested network may mean that the peak bandwidth available for streaming reads at the reader may be decreased, pushing the balance back towards seeking.
Other File Stores: SSDs
SSDs – which are ubiquitous in laptops and increasingly common on workstations – change things quite a bit. These solid state devices have no moving parts, meaning that there is no delay waiting for media to move to the right location. As a result, IOPS on these devices can be significantly higher. Indeed, traditional disk controllers and drivers become the bottleneck; a consumer-grade device plugged in as a disk will still be limited to 500MB/s and say 20k IOPS, while specialized devices that look more directly like external memory can achieve much higher speeds. (For those who want to know more about SSDs, Lee Hutchinson has an epic and accessible discussion of how SSDs work on Ars Technica; the article is from 2012 but very little fundamental has changed in the intervening three years).
At those rates, both streaming and seeking workflows see a performance boost, but the increase is much higher for IOPS. Rather than streaming a 1000MB file taking roughly as long as 2,500-4,000 seeks, it is now more like 40,000 seeks. That’s still finite, and each seek still takes roughly as much time as reading 25KB of data; but that factor of ten difference in relative rates will change the balance between whether streaming or seeking is most efficient for any given problem.
Running this same test on my laptop gives results as shown below:
We see that the laptop-grade hardware limits the performance of the streaming read; bandwidths (and thus the performance of the reservoir sampling) are down by about a factor of 2. On the other hand, we seem to have gained over a factor of 10 in IOPS, with approximately 3000 effective random reading IOPS. As a result, the seeking for a 0.1% sampling fraction takes a lightning-fast 2.5 seconds.
However, it’s worth noticing that, even with the decrease in bandwidth and startling increase in IOPS, the crossover point between where streaming wins over seeking has only shifted from 0.75% to 3%; beyond that, streaming is clearly the winner.
How to further improve seeking results
Mechanical hard drives will always be at a significant disadvantage for random-access workloads compared to SSDs. While SSDs are significantly more expensive than mechanical HDs for the same capacity, the increase in performance for these workloads (and their lower power draw for laptops) may make them a worthwhile investment for use cases where seeky access can’t be avoided.
It’s also possible to improve the performance of the seeky workload through software. As mentioned before, the file system OS layer and physical layer are highly concurrent, juggling many requests at once and shuffling the order of requests behind the scenes to maximize throughput. For a highly seeky workflow like this, it’s often possible to make use of this concurrency by launching multiple threads, each sending their read request at the same time, and waiting until completion before launching the next. This greatly increases the chance of finding a read request which can be performed quickly, making fuller use of the disk subsystem. This significantly increases the complexity of the user software, however, and I won’t attempt it for the purposes of this post.
Software: turning off OS caching
A smaller possible gain could be realized, for small sample fractions, by hinting to the operating system not to provide expensive caching that will not be used by the seek-heavy access pattern. This can be done by opening the file with O_DIRECT, or using posix_fadvise which allows a more flexible method for hinting to the operating system not to bother prefetching or caching, respectively, by passing POSIX_FADV_RANDOM and POSIX_FADV_NOREUSE. However, this is likely only helpful for very small sample fractions, where seeking is already doing pretty well; for moderate sample fractions, the prefetching can actually help (e.g., that downward trend in time taken at around 10%) so I did not include this in the benchmark.
How to further improve sequential results
Workstation-class SSDs, with appropriate controllers, also offer a significant
increase in streaming bandwidth over their mechanical counterparts, even if the
increase is proportionally less than that in IOPS. 4-5x increases are not
uncommon, and those would benefit the reservoir method here.
Software: Faster parsing
While Python is excellent for many purposes, there is no question but that it is
slower than compiled languages like C++. FASTA parsing is quite simple, and
for very small sampling fractions, there is no good reason that the resevoir solution should be a
factor of two or more slower than running wc -l. This hurts the reservoir
sampling more than the streaming, as the resevoir approach (which parses records which then
get bumped later) must parse ~17 times more records in this test than the seeking method.
Software: turning off OS caching
While prefetching is essential for the streaming performance we have seen,
there may be some modest benefit to turning off caching of the data we
read in; after all, even with the reservoir sampling, we are still only
processing each record at most once. Again, we could use
this time with only POSIX_FADV_NOREUSE. I again expect this to be a
relatively small effect, and so it is not tested here.
Conclusion: I/O is Complicated, But Streaming is Pretty Fast
This post only scratches the surface of I/O performance considerations. While we’ve thought a little bit about seeking vs sequential reading IOPS and bandwidth, even just streaming writes has different considerations (on the one hand, you can write anywhere that’s free, as opposed to needing to read specific bytes; on the other hand, there’s nothing like ‘prefetching’ for reads). More complex operations – and especially metadata-heavy operations, like creating, deleting, or even just appending to files – involve even more moving parts.
And even for seeking vs streaming reads, while the trends we’ve discussed here are widely applicable, different systems – underlying hardware (disk vs ssd) or how the file system is accessed (network-attached vs local) can greatly change the underlying numbers, which change the tradeoffs between seeking and streaming, possibly enough to make the conclusions different for any particular use case. We are empiricists, and the best way to find out which approach works best for your problem is to measure on the system that you use – fully aware that anyone else who uses your software might do so on different systems or for different-sized problems.
But a lot of software and hardware infrastructure is tuned to make streaming reads from filesystems as fast as possible, and it’s always worth testing to see if streaming through the data really isn’t fast enough.
Jared’s nanopolish tool for
Nanopore data uses poaV2,
the original partial order alignment software described in papers by
Lee, Grasso, and Sharlow 1,2,3, for
correcting the reads, following a similar approach taken by PacBio in
This post gives a quick lower-level overview of the steps in the POA
algorithm, with a simple implementation in python
to demonstrate the ideas more concretely.
The insight of the first POA paper was that “flattening” of
the alignment of sequences leads to meaningless artifacts that, while
largely harmless for pairwise alignments or even multiple alignments
of strongly conserved sequences, causes problems with more general
multiple alignments. For instance, consider the following sequences:
There is ambiguity in selecting a single, best alignment between
this pair of sequences; for instance below are 4 of
28 = 256
8 choose 4 = 105 nearly equivalent
ways of expressing this pairwise alignment. The best alignment will
depend on the particular gap-scoring scheme used.
While for a pairwise alignment this is comparatively harmless, as
additional sequences are added to form a multiple sequence alignment
(MSA), the choice between these ambiguities begin to distort the eventual
result. What we would like is to consider not necessarily a single linear
layout, but something that can express more unambiguously “one
sequence inserts a run of A, and the other of T”. And a natural way
to view that is with a graph:
The partial order alignment graph differs from the alignment strings
in that a given base can have multiple predecessors (eg, the C
after the fork being preceeded by both a string of As and of Ts)
or successors (eg, the C before the fork). But it is similar to
the alignment strings in that there is a directional order imposed,
both in the sense that each node has (zero or more) predecessors and
(zero or more) successors, but also that no repetition, or doubling
back, is allowed; the graph is constrained to be a Directed, Acyclic
Both repeats and re-orderings can be biologically relevant, and various
generalizations of alignment have allowed this 4,5.
This greatly generalizes the problem, moving it closer to assembly.
For the purposes of error correction in nanopolish, that additional
generalization is not needed.
To consider how alignment to a graph works, let ’s remind ourselves of
how we perform alignment on sequences.
algorithm and its variants, we consider two cursors - one on a base in
each sequence. For each pair of cursor positions in turn, we consider
the question of “what is the best sequence of alignments and insertions
that could lead to this position in the alignment”. Because the globally
optimal path must be made from locally optimal “moves” (that is, the
“principle of optimality” holds for this problem), this reduces to finding
out which of the three possible moves that would advance the cursors to
this position to choose from:
Both cursors having advanced, aligning (matching) these two bases;
Cursor 1 had advanced while cursor 2 remained fixed, inserting that base from sequence 1 into the alignment
Vice versa, with cursor 2 advancing and cursor 1 staying fixed.
A familiar diagram follows below; of those three possible moves, we take the
running scores from each of those previous positions, add the score
corresponding to the move, and set the score of the current position.
We can calculate the scores for pairs of positions in any order we like
– along rows of the matrix, columns, or minor diagonals –
as long as for any position we calculate, the scores for the previous
positions we need have already been calculated.
String to Graph Alignment
Aligning a sequence to a DAG introduces suprisingly little complexity to the
dynamic programming problem; the clever diagram in the POA paper with a dynamic
programming matrix with 3D “bumps” may have had the unintended consequence of
making it look more complicated than it is.
The primary difference for the purposes of dynamic programming is
that while a base in a sequence has exactly one predecessor, a base
in a graph can have two or more. Thus, the cursor may have come from
one of several previous locations for the same (graph) Insert or Align
moves being considered; and thus those scores must be considered too in
determining the best previous position. (Note that insertions from the
sequence are unchanged).
So, to reiterate: the only difference deep inside the dynamic programming
loop is that multiple previous scores (and any associated gap-open
information) must be considered for insertions or alignments of the
graph base. This is implemented by a loop over predecessors for the current
base, and all else remains the same.
There is one step that has to happen before that dynamic programming loop,
When aligning two sequences, one could choose an order to loop over the
sequence indices before hand so that, for any new position being calculated,
the necessary previous scores would already be ready.
The nodes in the graph, however, do not have such a useful intrinsic
order. If the nodes are considered in the order they are added, for
instance, then the newest nodes inserted with a new sequence –
which may have been inserted as predecessors of nodes that had been
inserted earlier – will not have been already scored when their
successor begins its calculation.
The answer is to use a Topological Sort to generate an
ordering of nodes in which every node is guaranteed to follow all of
its predecessors. This is always possible for a directed graph as long
as there are no cycles, and indeed there can be many such orderings.
Topological sorts are how make and similar tools decide in which order
to perform tasks in a workflow, and how many6 spreadsheet programs decide
if cells need to be updated.
There are two main classes of algorithms for performing topological sorts; the
algorithm of Kahn (1962), and a repeated depth-first search. Either serves
perfectly well for the dynamic programming problem.
So to align a sequence to a graph, the steps are simply:
Perform a topological sort if the graph has been updated
Do the dynamic programming step as usual, with:
The graph nodes visited in the order of the topological sort, and
Considering all valid predecessors for align/insert moves.
Insertion of aligned sequence
Consider that we have a graph that so far only contains the sequence CGCTTAT,
and our dynamic programming calculation aligning the sequence CGATTACG has
given us an alignment that looks like this:
That is, for each base in the sequence, it is paired (either as match or
mismatch) with a base in the graph, or it is inserted.
We expect inserting the new sequence into the graph to give us something like:
Here we see for the first time two types of edges; bold, directed edges
(with directions not shown, but left-to-right), indicating
predecessor/successor; and dashed lines, indicating that (say) the A and C
that are three bases from the start are aligned to each other, but are
mismatches; similarly with the C and T towards the end.
We keep track of both the predecessor/successor nodes and all ‘aligned-to’
nodes. We walk along the sequence we are inserting and its calculated
alignment. We insert nodes in the sequence if they are not aligned to
anything, or none of the nodes that it directly or indirectly aligns to have
the same base; otherwise, we re-use that node and simply add new edges to it if
In more detail, the steps we take are as follows:
A new “starting point” for this sequence is created in the graph.
The previous position is set to this starting point.
For each sequence base in the calculated alignment,
If the current base is not aligned to a node in the graph, or if it is but neither the node nor any node it is aligned to has the same base,
A new node is created with the sequence base, and is selected as the current node
This new node is aligned to the aligned node if any, and all of the “aligned-to” nodes are updated to align to this one.
That node with the same base is selected as the current node
If one does not already exist, a new edge is added from the previous position to the current node
That edge has the current sequence label added to it; the number of labels on the edge correspond to the number of sequences that include that edge and those two nodes.
Fusing nodes whenever possible ensures that information about a motif that
several times in several sequences in a similar location is not obscured
by corresponding to several paths through the graph; It also increases
the runtime of the algorithm by limiting the number of nodes and edges that
need to be considered.
Note that one can always reconstruct any individual sequence inserted into the
graph by looking up its starting point, and following edges labelled with the
corresponding label through the graph.
Once an aligned sequence is inserted, a new topological sort of the nodes is
generated, and another alignment can be perfomed.
Now that you have all of your sequences in the graph, how do you get things
like a consensus sequence out of it? This is the topic of a paper2 separate
from the first one.
Finding the single best-supported traversal through the graph is relatively
straightforward. In fact, this is again a dynamic programming problem; one
sets the scores of all nodes to zero, and then marches through the graph node
by node. At each node, one chooses the “best” edge into that node – the
one with the most sequences including it – and sets the score to be
the edge weight plus the score of the node pointed to; and in case of a tie
between edges, one chooses the one pointing to the highest-scoring node.
The highest score and the edges chosen gives you a maximum-weighted path
through the graph. As is pointed out in the consensus paper, this is
a maximum-likelihood path if the edge weights correspond to the probabilities
that the edge is followed.
However, there may well be multiple consensus features in the alignment that
one wishes to extract; a feature seen by multiple but still a minority of
sequences. The approach to finding remaining consenses is necessarily somewhat
heuristic, and comprises the bulk of the consensus paper.
The basic idea is to somehow remove or downweight the edges that
correspond to the already-extracted consenses, and repeat the procedure
to find additional features. The steps recommended in the consensus paper are:
Identify sequences that correspond to the consensus just identified; by (eg) fraction of their bases/edges included, possibly with other requirements
For edges corresponding to those sequences, reduce the weight corresponding to those sequences, possibly to zero
Rerun the consensus algorithm.
In the simple implementation we use to demonstrate these ideas, we simply
choose all (remaining) sequences that have a majority of their bases
represented in the current consensus sequence, remove the corresponding weight
of those edges entirely, and repeat until no further sequences remain or no
significant consensus sequence is found.
The consensus paper identifies a particular corner case where a consensus
sequence might terminate early; we allow this to happen.
Finally, to communicate the alignment results, it can still be useful
to generate a “flattened” alignment of the input and consensus sequences.
This is again fairly straightforwardly done once the graph is topologically
sorted. Each node in the graph, in topological order, is assigned a column
in the final table to be generated, with rings of nodes that are aligned to
each other assigned to the same column, and nodes that are not aligned to any
others getting their own column. Then the bases are filled in, with each
sequence (including the consensus sequences) getting their own row.
Because we are assigning columns to the nodes in topologically-sorted order,
the method used to generate the (non-unique) topological sort affects how
the alignments look as alignment strings, even if they are all functionally
identical. Kahn sorting tends to interleave the results of sequences, whereas
depth-first-search necessarily visits long strings of runs in order. DFS
then generates better looking alignment strings, so we use that approach
in the implementation below.
A simple but fully functional Python implementation of the algorithms
described above can be found here.
For the alignment stage, two implementations are given; one
that is quite simple to follow but is very slow; and another
that is significantly faster, but may require a little more careful
reading, as it uses numpy vectorization to improve performance.
Even the faster implementation is still slow – about 10 times slower
than the poaV2 code written
in C as distributed, or closer to 20 if poaV2 is compiled with -O3
– but is nonetheless useable for small problems.
The simple implementation above can generate HTML with an interactive
graph visualization to explore the final partial order graph; the
visualization works particularly well on browsers with a high-performance
than a thousand nodes or so.
Partial order alignment is a powerful technique that results in a
graph containing rich information concerning the structure of the
aligned sequences, but lacks the amount of online documentation and
easy-to-explore implementations of some other methods; we hope this
helps introduce a broader audience to a more in-depth understanding
of the method.
This post describes a new module I added to our nanopolish software package that aligns the signal data emitted by a nanopore to a reference genome. This is in contrast to most approaches which align two DNA sequences to each other (for example a base-called read and a reference genome). To make sense of what aligning signal data to a reference genome means, I will describe at a high level my model of how nanopore sequencing works. For a more detailed and technical description, see the supplement of our preprint.
A nanopore sequencer threads a single strand of DNA through a pore embedded in a membrane. The pore allows electric current to flow from one side of the membrane to the other. As DNA transits this pore it partially blocks the flow of current, which is measured by the instrument. In Oxford Nanopore’s MinION system the measured current depends on the 5-mer that resides in the pore when the measurements are taken.
The MinION samples the current thousands of times per second; as 5-mers slide though the pore they should be observed in multiple samples. The MinION’s event detection software processes these samples and tries to detect points where the current level changes. These jumps indicate a new 5-mer resides in the pore. To help illustrate this I’ve reproduced a figure from our preprint below:
This is a simulation from an idealized nanopore sequencing process. The black dots represent the sampled current and the red lines indicate contiguous segments that make up the detected events. For example the mean current was around 60 picoamps, plus a bit of noise, for the first 0.5s. The current then dropped to 40 pA for 0.1s before jumping to 52 pA and so on.
The event detection software writes the events to an HDF5 file. The raw kHz samples are typically not stored as the output files would be impractically large. Here’s the table of events for this simulation:
To help translate events into a DNA sequence, Oxford Nanopore provides a pore model which describes the expected current signal for each 5-mer. The pore model is a set of 1024 normal distributions - an example might look like this:
This indicates that the measured current is expected to be drawn from when AAAAA is in the pore and so on.
Inferring Bases from Events
Using the pore model and the observed data we can solve a number of inference problems. For example we can infer the sequence of nucleotides that passed through the pore. This is the base calling problem. We can also infer the sequence of the genome given a set of overlapping reads. This is the consensus problem, which we addressed in our paper.
These inference problems are complicated by two important factors. First, the normal distributions for 5-mers overlap. There are 1024 different 5-mers but the signals typically range from about 40-70 pA. This can make it difficult to infer which 5-mer generated a particular event. This is partially mitigated by the structure of the data; a solution must respect the overlap between 5-mers so a position that is difficult to resolve may become clear when we look at subsequent events. Second, event detection is performed in real time and inevitably makes errors. Some events may not be detected if they are too short or if the signals for adjacent 5-mers of the DNA strand are very similar. The extreme case for the latter situation occurs when sequencing through long homopolymers - here we do not expect a detectable change in current. The opposite problem occurs as well. The event detector may split what should be a single event into multiple events due to noise in the system that looks like a change in current. Handling these artefacts is key to accurately inferring the DNA sequence that generated the events.
Aligning Events to a Reference
The hidden Markov model we designed for the consensus problem had 5-mers of a proposed consensus sequence as the backbone of the HMM, with additional states and transitions to handle the skipping/splitting artefacts. In our preprint we used this HMM to calculate a consensus sequence from a set of reads. If we make a reference genome the backbone of the HMM, we can use it to align events to the reference.
The new eventalign module of nanopolish exposes this functionality as a command line tool. This program takes in a set of nanopore reads aligned in base-space to a reference sequence (or draft genome assembly) and re-aligns the reads in event space.
The pipeline uses bwa mem alignments as a guide. We start with a normal bwa workflow:
This is the complement strand (c) of a 2D Nanopore read from Nick’s E. coli data aligned to E. coli K12. The first event listed (event 27470) had a measured current level of 50.57 pA. It aligns to the reference 5-mer ATTGC at position 10,000 of the reference genome. The pore model indicates that events measured for 5-mer ATTGC should come from , which matches the observed data very well. The next 3 events (27471, 27472, 27473) are all aligned to the same reference 5-mer (TTGCG) indicating that the event detector erroneously called 3 events where only one should have been emitted. Note that the current for these 3 events are all plausibly drawn from the expected distribution .
This output has one row for every event. If a reference 5-mer was skipped, there will be a gap in the output where no signal was observed:
This post describes changes I have made to nanopolish, our HMM-based consensus caller for Oxford Nanopore data. This post can be thought of as a long changelog with background and rationale.
Background and History
Nick Loman, Josh Quick and I started working on nanpore assembly at a hackathon at the Newton Institute in Cambridge. Our initial goal was pretty simple; we wanted to see if we could run DALIGNER on nanopore data and devise a way to error correct the reads. After a lot of tinkering and “proper bioinformatics” as Nick put it (converting file formats) we were able to run poa on the overlapping reads that DALIGNER found. Taking poa’s consensus sequence as the error corrected read improved identity to around 92-93%. Nick was able to get Celera Assembler running on the corrected reads and our assembly became progressively better as Nick and Josh added more data.
Once Nick got a single-contig assembly out of Celera Assembler we turned our attention to improving the accuracy of the final assembly. The consensus sequence that Celera Assembler called off the corrected reads had accuracy of about 98%. We knew that by working with the base-called reads, rather than the raw signal data emitted by the nanopore, we were losing a lot of information. During the winter holidays I started to write code that would use the raw current signal to call a new consensus. My initial exploratory code was in Python as the poretools package gave convenient access to the raw signal data encoded in ONT’s FAST5 files. I wrote a quick hidden Markov model in Python to calculate the probability of observing a sequence of nanopore signals given an arbitrary sequence. I immediately realized my Python HMM would be far too slow to run on even a bacterial genome so I decided the core algorithms would have to be written in C or C++.
I asked on twitter the best way to call out to a C++ library from Python and received many helpful replies (h/t to Titus Brown, Michael Crusoe and others). I settled on using ctypes as this bridge between the Python frontend/poretools and the HMM in C++. I was surprised at how easy ctypes makes this - I had Python talking to a prototype C++ library in under an hour. This hybrid Python/C++ solution was just fast enough to make model development and testing possible. We spent the next month or so revising the probabilistic model of the data, developing algorithms to propose candidate consensus sequences and testing them on our E. coli data. Once the model settled we ran it on the single-contig assembly which took a few days running in parallel on Nick’s server. We wrote up a preprint describing this work and posted it on BioRxiv.
Improving the design
I was not satisfied with the Python/C++ hybrid design. I am sensitive to installation issues when releasing software as I have found that installing dependencies is a major source of problems for the user (although great projects like homebrew-science are helping a lot here). I admire Heng Li’s software where one usually just needs to run git clone and make to build the program. The initial version of nanopolish was far from this ideal as it depended on eight Python libraries that needed to be installed with pip. When moving between my local development version and Nick’s server I realized that installing these dependencies often failed. With this in mind I decided to rewrite the Python frontend in C++. To do this, I crucially needed a replacement for poretools which I used to access the raw data. Matei David in my group volunteered to help and wrote an excellent, intuitive C++ library for parsing ONT’s FAST5 files.
There were additional benefits to this rewrite. In the Python version I again used poa to compute an initial multiple alignment and used this to seed the HMM realignment. In the C++ version I discarded this step, removing another dependency, by calculating the initial alignment directly from the BAM file using htslib. This simplification, along with the much faster parsing of the FAST5 files provided by Matei’s library, reduced startup time from a few minutes to a few seconds. This has helped me iterate on the code much faster during development.
Improving HMM efficiency
Despite writing the HMM in C++ the first version of nanopolish was very slow. After the Python to C++ rewrite I focused on improving run time. During development I use a lightweight header-only profiler to keep track of where time is being spent in my program. As expected over 90% of the time was spent running the forward algorithm on the hidden Markov model. I used the amazing perf kernel profiler to explore this further. perf indicated that most time was in the and functions. The forward algorithm on HMMs requires summing log-transformed probabilities. The naive way, , requires two calls to and one call to for every state/emission pair in the HMM. This is very slow and the subject of an entire section in the classic Biological Sequence Analysis text. I remembered reading Sean Eddy’s paper on accelerating HMMER3. In this paper Sean describes how the calculation can be improved by using the transformation where . On the surface this would only save a single call to but Sean goes further by using a table indexed by to cache . This method completely removes the and calls in the inner loop of our HMM. After plugging Sean’s implementation into nanopolish we immediately had an 8-fold improvement in speed. Thanks to Sean for allowing us to use this as public domain code.
Improving memory layout
The perf output also indicated that a lot of CPU time was wasted waiting on memory access. To improve cache usage and reduce the amount of data that is transferred over the memory bus, I reduced the precision of the floating point values from 64 bits to 32 bits. At the same time, I changed the memory layout of the nanopore event signals so that data accessed together was located in contiguous memory locations. This change simply interleaved two arrays, one storing event currents and one storing event durations, into a single array of structs. Finally, I pre-computed all of the scalings and transformations that need to be applied to the nanopore events (for example the current signal drifts over time and this needs to be corrected for) to reduce the work in the inner loop of the HMM.
This collection of changes reduced the average time spent in the forward algorithm of the HMM from per call to per call for 100 input events and a 100bp sequence, an improvment of over 10x. This did not require any changes at the algorithm level, only minor code optimizations. My next goal is to make algorithmic improvements, primarily avoiding testing unlikely candidate sequences in the HMM. Version 0.2.0 of nanopolish is here.