A new X-prize has been announced, and this one is right up my alley. They're offering 10 million dollars (and an enormous amount of publicity) to the first team to decode the genes of 100 people in 10 days or less. Unfortunately, the announcement is fairly unclear as to what they mean by 'decode'. For the purposes of this discussion, I'll assume that it means sequence and assemble the patient's complete DNA and annotate all known genes and variations within.
First, some background_ The process of 'decoding' that they describe has three major steps_
Step 1_ Sequencing.
This involves getting the raw data out of a person's DNA - strings of base pairs that look something like this_ AACGTCTAAAGATA. Sequencing is typically done using a technique called shotgun sequencing. Your DNA (all 3 billion basepairs of it) is broken up into small chunks, and the pieces are sequenced. The currently used method, Sanger sequencing, is fairly slow, which is why Baylor has three floors devoted to the genome center - it requires a lot of machines to get decent throughput. This technology has been around for decades, but may not last much longer. The new kid on the block is pyrosequencing, pioneered by 454 Life Sciences. Their process is a lot cheaper (2-10x) and a lot faster (100x), but results in much shorter read lengths.
Step 2_ Assembly
A convenient way to think about assembly is that it's like doing a giant jigsaw puzzle. After sequencing, what we have is a collection of millions of fragments of DNA. Using complicated computer algorithms, we're able to sort through all these pieces and try to put it all back together again. It's a lengthy process, complicated by a variety of things like repeat sequences (identical puzzle pieces). Using pyrosequencing, which gives us much shorter reads, means that we're doing a puzzle that is much more complicated, because it has many more pieces, and each piece is 7 times smaller.
Step 3_ Annotation
After getting our sequence assembled, we still have to go back and figure out which parts are genes. Then, we have to take a detailed look at each of these genes to identify mutations. While many mutations are harmless, there are lots of others that cause disease states. For obvious reasons, we'd like to know where these are.
Doing the math
- A human genome is about 3 billion base pairs long.
- Pyrosequencing gives us about 67Mbp per hour.
- with 454, 8x oversampling only gives you 95% coverage of unique genome portions, so I'll guesstimate and say we'll have to double the amount of sampling to get 3-fold coverage near 99.9% completeness.
- I'll assume that you'll need to use half the time to sequence, and the other half to assemble/annotate
So, in order to meet this goal, using current technology, you'd need something like 600 pyrosequencing machines. Oh, and have I mentioned that they cost a half-million dollars apiece? And that this doesn't include the costs of reagents and manpower??
You'd also need one hell of a computer cluster. The computational power needed to do this kind of project in 5 days is very likely out of the reach of any cluster in existence today.
So yes, this is a substantial challenge by today's standards. Can it be done in 5 years? I'm inclined to say yes. There are several other high-throughput sequencing technologies scheduled to make their debut soon, and current technologies, like pyrosequencing, will only get better. I'd expect that you could see at least a 10-fold improvement in both price and performance within the next two years or so. And even if Moore's Law has been slipping recently, I really don't forsee computational power being the limiting factor here.
So put me on record as betting that yes, someone will claim this prize. Any naysayers want to place a small wager on it?
UPDATE: Thanks to the commentors for pointing a few things out. First of all, it appears that sequencing is the only step required. I couldn't find the details anywhere last night, as the xprize website is pretty poorly designed. Secondly, this has to be accomplished on one machine, which makes the challenge much harder. Now, I'm a bit more skeptical that this can be accomplished. I'll retract my bet and stick to "it might be possible".