Dr. Green is the Director of the NIH’s National Human Genome Research Institute (NHGRI). We caught up with Dr. Green at McGill’s 2011 Human Genetics Graduate Student Research Day, where he gave the keynote presentation.
Scientists often speak of “sequencing the human genome”. Is there really a single genome we can use as reference?
You used the key word ‘reference’: the Human Genome Project (HGP) was said to have sequenced the human genome. Really, the more accurate phrase that should’ve been used was: we created a reference sequence of the human genome. You shouldn’t think of the product of the HGP as the sequence of a human being, because that’s not actually true. In fact, what the HGP produced was a sequence of all human chromosomes—roughly 3 billion letters in total. But any given human being has 6 billion letters: 3 billion from mom, 3 billion from dad.
So we have to distinguish a reference sequence, which is the hypothetical representation of the sequence of each human chromosome, from a personal genome sequence, which is the full representation of the two copies of each of your chromosomes.
Is the reference genome very different from our genomes?
In some ways, it is. A reference sequence only differs from the genome you got from your parents by about 1 in a 1000 bases, which means that the reference sequence represents what you’ll find in the human species at about 99.9%. On the one hand, that’s incredibly similar. But on the other hand, the richness of what we want to learn is in that 0.1%; that’s what we’re most interested in if we think about health and disease. Those differences are called genetic variants and they can confer risk for disease or give protective characteristics.
So generating the human genome sequence is the HGP’s attempt at providing a framework—a reference or starting point—for being able to understand sequence differences and correlate those to health, disease, drug response and so forth.
So we can’t simply compare the reference to someone’s genome and look for differences?
That’s right. You don’t want to ask the question “does my genome differ from the reference?” That’s too simple of a question. The question you want to ask is “given those variants at this particular place in the genome, have they ever been seen before?” If so, how often have they been seen?
We now have databases that not only list all the variants that exist but also tell us the frequency with which we see them. So just because you differ from the reference sequence doesn’t mean anything.
Is all this research happening because it’s becoming cheaper to sequence?
A very important aspect is indeed the cost of sequencing that is dropping precipitously. The first human genome sequence cost us about 3 billion dollars—best $3 billion ever spent. Now the cost of sequencing your entire genome is on the order of $10,000, so we’ve gone from a billion to $10,000 in about 8 years. That’s pretty good. But we’re motivated to do this primarily because we know we have to: We can’t just have 1 human genome sequence, we need a whole lot more.
Can we go down to $100?
We proposed $1,000 in 2003 and we thought we were crazy. We would love it to be cheaper and cheaper but the truth of the matter is that we shouldn’t lose sight of where we are now. The $1,000 is very cheap compared to the cost of understanding what it actually means.
I don’t think that much about the cost of genome sequencing because I think we will eventually coast to the $1,000, or even $100, genome. That’s not where the burden is. Right now, the grand challenge is understanding that sequence: if I handed you your genome sequence—the perfect complete 6 billion letters—you would have to invest a lot of money to understand it. As of now, we don’t yet know how to interpret all that data, so the challenge now really lies not in data generation but data analysis.
How different are two genomes?
You and I roughly differ by 3 to 5 million single nucleotides, and the great majority of those are completely innocent—they have no phenotypic consequences. A small subset of those do, but we have very little knowledge about how to sift through them. We can make the list of variants but we’re not yet at the point where we can identify which ones we need to focus on and what their effect on human health is. We now need to take these catalogues of variants and start attributing biological and clinical relevance to them—that’s the next decade. Maybe people like you will help us figure it out.
Have we spotted some variants that are responsible for, say, cancer?
Sure, but we’re still at the very tip of the iceberg. Cancer is a great example because there’s a lot of action: It’s very clear that we need to sequence (and are now sequencing) cancer genomes and cataloging things that come in, and here’s why:
You can take 100 tumor samples and analyze them under the microscope: They’ll all look the same. But when you sequence their genomes, you might see that 50 of them tend to have a certain set of variants (i.e. a signature) and maybe the remaining 50 will have another signature.
We might even correlate a signature with groups of people that respond poorly to therapy. It would be great if we knew this upfront because it means we wouldn’t have to put poor reponders through chemotherapy. Instead, we’d look for more appropriate treatments