New Sequencing X-Prize


A new X-prize has been announced, and this one is right up my alley. They're offering 10 million dollars (and an enormous amount of publicity) to the first team to decode the genes of 100 people in 10 days or less. Unfortunately, the announcement is fairly unclear as to what they mean by 'decode'. For the purposes of this discussion, I'll assume that it means sequence and assemble the patient's complete DNA and annotate all known genes and variations within.

First, some background_ The process of 'decoding' that they describe has three major steps_

Step 1_ Sequencing.
This involves getting the raw data out of a person's DNA - strings of base pairs that look something like this_ AACGTCTAAAGATA. Sequencing is typically done using a technique called shotgun sequencing. Your DNA (all 3 billion basepairs of it) is broken up into small chunks, and the pieces are sequenced. The currently used method, Sanger sequencing, is fairly slow, which is why Baylor has three floors devoted to the genome center - it requires a lot of machines to get decent throughput. This technology has been around for decades, but may not last much longer. The new kid on the block is pyrosequencing, pioneered by 454 Life Sciences. Their process is a lot cheaper (2-10x) and a lot faster (100x), but results in much shorter read lengths.

Step 2_ Assembly
A convenient way to think about assembly is that it's like doing a giant jigsaw puzzle. After sequencing, what we have is a collection of millions of fragments of DNA. Using complicated computer algorithms, we're able to sort through all these pieces and try to put it all back together again. It's a lengthy process, complicated by a variety of things like repeat sequences (identical puzzle pieces). Using pyrosequencing, which gives us much shorter reads, means that we're doing a puzzle that is much more complicated, because it has many more pieces, and each piece is 7 times smaller.

Step 3_ Annotation
After getting our sequence assembled, we still have to go back and figure out which parts are genes. Then, we have to take a detailed look at each of these genes to identify mutations. While many mutations are harmless, there are lots of others that cause disease states. For obvious reasons, we'd like to know where these are.

Doing the math

  • A human genome is about 3 billion base pairs long.
  • Pyrosequencing gives us about 67Mbp per hour.
  • with 454, 8x oversampling only gives you 95% coverage of unique genome portions, so I'll guesstimate and say we'll have to double the amount of sampling to get 3-fold coverage near 99.9% completeness.
  • I'll assume that you'll need to use half the time to sequence, and the other half to assemble/annotate

So, in order to meet this goal, using current technology, you'd need something like 600 pyrosequencing machines. Oh, and have I mentioned that they cost a half-million dollars apiece? And that this doesn't include the costs of reagents and manpower??

You'd also need one hell of a computer cluster. The computational power needed to do this kind of project in 5 days is very likely out of the reach of any cluster in existence today.

So yes, this is a substantial challenge by today's standards. Can it be done in 5 years? I'm inclined to say yes. There are several other high-throughput sequencing technologies scheduled to make their debut soon, and current technologies, like pyrosequencing, will only get better. I'd expect that you could see at least a 10-fold improvement in both price and performance within the next two years or so. And even if Moore's Law has been slipping recently, I really don't forsee computational power being the limiting factor here.

So put me on record as betting that yes, someone will claim this prize. Any naysayers want to place a small wager on it?

UPDATE: Thanks to the commentors for pointing a few things out. First of all, it appears that sequencing is the only step required. I couldn't find the details anywhere last night, as the xprize website is pretty poorly designed. Secondly, this has to be accomplished on one machine, which makes the challenge much harder. Now, I'm a bit more skeptical that this can be accomplished. I'll retract my bet and stick to "it might be possible".

Comments

Written by Martin Corcoran -

The assembly and annotation of an already completed genome with a well validated reference sequence (like the human genome) is much easier to cope with than an unknown genome. Anything longer than 40 bases can easily and unambiguously be placed in its correct position in the human genome. Indeed for most sequences 20 - 25 bases is all you require. Computing power will not be a problem, its generating the sequences that will be the hold up. Even now we could probably do it, its just that its too expensive for the reagents.

Written by RPM -

I don't think annotation is part of the competition. From the website_ "The $10 million X PRIZE for Genomics prize purse will be awarded to the first Team that can build a device and use it to sequence 100 human genomes within 10 days or less, with an accuracy of no more than one error in every 10,000 bases sequenced, with sequences accurately covering at least 98% of the genome, and at a recurring cost of no more than $10,000 per genome." There's nothing there about annotation, only sequencing and assembly. The assembly should be greatly fascilitated as this is a resequencing project, not a de novo project. There are already algorithms in place for assembly genomes on top of a pre-existing scaffold (such as a closely related genome or a genome from the same species). I think the hardest part is that it must be done on a single machine. Do they expect this machine to perform the sequencing reactions (probably pyrosequencing based), the base calls, AND the computational work required to assemble the reads?

Written by Chris -

Interesting - I had assumed that assembly/annotation was part of it. And now that Martin brought it up, I suppose that there is a very good reference genome to align against, which means that annotation is easier (but still not trivial, by any means!) 25 bases is enough to cover most regions, but when you get into long mononucleotide repeats or repeat sequences like Alus, things get a lot more complicated. Thanks for the pointer to the actual announcement. That info was nowhere to be found last night.

comments powered by Disqus