This article discusses the major advances that have taken place in sequencing methodologies - what next-generation sequencing (deep sequencing) is, what it is being used to reveal, and its implications for disease diagnosis and personalised treatment, as well as any limitations of this technology.
by Jeremy Schwartzentruber & Jacek Majewski
Although DNA sequencing has been around for decades, recent dramatic improvements in speed and reductions in the cost of sequencing are in the process of revolutionising medical research [Figure 1].
The first full human genome assembly was published in the year 2000. From around this time up until 2008 the cost of sequencing fell exponentially [Figure 1], keeping pace with Moore’s law – the rule of thumb coined in the microprocessor industry that predicted computing performance would double every two years. This pace of development is already quite an achievement. Then, at the start of 2008, commercial next-generation sequencing (NGS) machines became available and enabled an even more dramatic increase in throughput and a simultaneous reduction in cost. Five years ago reading a megabase of DNA sequence cost over $500; today, it costs less than 10 cents. The resulting massive output of sequence data has enabled new applications across a variety of research fields. Anthropologists are learning about human population history and diversity from extensive sequence data on existing populations, and also from the ancient DNA of extinct hominids. Research in the genetics of model organisms such as the mouse and the fruit fly is greatly accelerated by affordable genome sequencing. In the field of metagenomics, the deep sequencing enabled by NGS is providing detail on the diverse microbial communities that inhabit the human skin, mouth and gut. And not surprisingly, medical research is in the process of being transformed by NGS technologies. Along with these new opportunities, the flood of data is creating an acute need for improved information technology (IT) infrastructure, software analysis tools, and most critically, scientists and professionals with the skills to manage, analyse and interpret the data.
The technological changes that triggered this breakthrough are similar to those in the microprocessor industry decades earlier: miniaturisation and massive parallelisation. While many different sequencing technologies are now available, all NGS methods involve reading the sequences of millions of DNA fragments in parallel. This DNA can either be genomic DNA or complementary DNA produced from cells’ RNA, called RNA-seq. NGS technologies can be broadly divided into "second-generation" methods that operate on clonally amplified DNA fragments and "third-generation" single-molecule techniques. The most commonly used sequencers, such as Illumina Hiseq, Roche 454 and ABI SOLiD, use clonally amplified DNA; that is, each "read" of DNA sequence comes from a single cluster of identical molecules. The length of a sequence read typically ranges from 50 to 500 base pairs. The Illumina HiSeq currently has the highest throughput, at up to 600 gigabases of sequence per 11-day run, and also the lowest cost per megabase (MB) of DNA sequenced, at around $0.03 per MB. To put this in context, the amount of raw data generated is enough to sequence three to four whole human genomes so that each of the three billion base pairs of the genome would have 30 independent DNA sequence reads covering it. This excess of data is needed because the coverage varies slightly across the genome, and because heterozygous genotypes can only be accurately determined when enough reads support each allele. It should be noted that technology improvements are occurring multiple times per year for each of the leading sequencing systems, so numbers for sequencing throughput and cost quickly become out of date.
Single-molecule systems, such as the Ion Torrent sequencer from Life Technologies and the Pacific Biosciences RS, became available in 2010 and 2011, respectively. They have advantages such as generating read lengths up to a few thousand base pairs, which aid greatly in assembling new genomes without an existing reference sequence, and their run times can be as short as an hour. Although these single-molecule systems have not yet reached the throughput or cost efficiency of second-generation machines, they appear to have greater potential for future improvement. Numerous other single-molecule sequencing technologies are on the horizon, and promise to increase the throughput and decrease the cost of sequencing by orders of magnitude. A new entrant to the field, Oxford Nanopore, claims that they will soon be able to sequence a whole human genome in just fifteen minutes.
Impact on disease research
The cost of sequencing a human genome is now in the order of $5,000, approaching the goal of a "thousand-dollar genome", at which point many believe that personal genome sequencing will become ubiquitous. While we have not quite reached the thousand-dollar genome, we are at the thousand-dollar exome already. This involves capturing the 1.5% to 2% of the genome that codes for proteins and washing away the rest of the DNA. Due to the capture and washing steps, exome sequencing can actually require slightly more input DNA than whole genome sequencing, though both are in the range of 1 to 5 µg. For many applications this amount of DNA is available, as sufficient quantities can be extracted from the white blood cells in a standard blood sample. However, in some cases, such as when extracting DNA from a surgically removed tumour, there is limited DNA available.
In addition to costing an order of magnitude less than whole-genome sequencing, exome sequencing has the advantage that it focuses on the portion of the genome that we have tools to interpret. The value of this should not be underestimated. An article in the New England Journal of Medicine stated in 2010, "Physicians are still a long way from submitting their patients’ full genomes for sequencing, not because the price is high, but because the data are difficult to interpret" . Two years later, we are scarcely closer to knowing how to interpret non-coding regions of the genome. On the other hand, each week multiple studies are published that use exome sequencing to reveal the genetic basis of both rare and common diseases.
To study a rare disease, a geneticist used to need a large family with multiple affected relatives to apply linkage mapping – an analysis that narrows down the genomic region of interest based on regions shared by affected relatives. This analysis would be followed by selecting individual genes in the region for sequencing using older "gold standard" (but low-throughput) sequencing technologies. With exome sequencing, one sequences all genes and screens for rare genetic variants – single-nucleotide variants (SNVs) and insertions or deletions (indels) – that seem likely to disrupt gene function. In contrast with linkage mapping, it is often possible to determine the genetic variant causing a rare disease in a small family with as few as one or two affected individuals. However, to do so requires one to distinguish the disease-causing variant from among the thousands of variants that are found in any individual’s exome. Jay Shendure, a leader in NGS-based disease analysis, refers to this as finding "needles in stacks of needles" .
While there are software tools that attempt to predict whether a mutation is likely to be damaging, these tools are far from perfect. Indeed, the most useful "tools" in genetic disease research are the large public databases of genetic variation, such the 1000 Genomes Project (www.1000genomes.org), which aims to sequence about 2500 whole genomes from healthy individuals, and the NHLBI Exome Sequencing Project (evs.gs.washington.edu/EVS), which has already sequenced over 7000 exomes of people with heart, lung and blood disorders. Using these databases, one can filter out DNA variants that are too common to be a plausible cause of the disease of interest.
One of the most promising uses of NGS is to study common diseases such as cancer. While all cancers are a result of genetic mutations, cancers arising in different organs and tissues of the body often have distinct genetic causes. Our own exome sequencing results recently demonstrated that a specific childhood brain cancer, glioblastoma, is frequently caused by mutation of a specific histone protein at the exact same position in different patients . This scenario is especially surprising because in general, mutations at many different positions in a gene could disrupt its function. Indeed, the genetic heterogeneity of cancer presents a major challenge to developing effective therapies, as a given therapy may succeed or fail depending on the particular mutations in an individual tumour. NGS may soon prove useful in cancer diagnosis and therapy, as determining the genetic subtype of a tumour could guide individualised treatment.
You might wonder when genome sequencing will move beyond the lab and into the clinic. A number of companies currently provide genome-wide tests to consumers that can determine an individual’s risk for a variety of diseases. However, so far NGS has been too costly for consumers. Instead, these companies’ tests rely on microarrays, which only determine a person’s genotype at specific markers across the genome that may be associated with specific diseases. Such an approach misses most of the variation in an individual’s genome, and in particular will miss nearly all rare variants, which are increasingly believed to be responsible for a large fraction of both rare and common diseases.
Recently, the company 23andMe started to offer the service of sequencing a personal exome for $999. So far, access to the service is limited to "customers who are comfortable managing and understanding raw genetic data." Although a raw genome sequence is not yet useful to most consumers, it will one day be useful to medical doctors. The greatest steps towards personalised genomics and therapeutics have so far come from large research programmes, such as the ClinSeq project by the National Human Genome Research Institute . This project enrolled more than 500 participants both with and without a family history of cardiovascular disease, and for each individual conducted a clinical evaluation, exome sequencing and provided genetic counselling for any results deemed medically important to the study participants.
It is important to realise that while genome sequencing may discover mutations that relate to the original medical issue or diagnosis, it is highly likely to reveal "incidental findings" – genetic variants in an individual unrelated to the original diagnosis but which may be medically relevant and actionable. Such findings could include the likelihood of late-onset genetic disease (e.g. Huntington’s disease), increased risk for disease (e.g. breast cancer susceptibility), or information that the individual carries a recessive allele that could lead to disease in his/her children. Properly informing people about genetic disease is already challenging, but even more so when there are multiple results to explain and when findings are unrelated to the original diagnosis. Experience from the ClinSeq project shows that participants usually want to receive "all" of the available results from their genome sequencing, but when their results are explained they are often overwhelmed and quickly reach a point of nformational saturation .
Hospitals are just now beginning to enter the game. Children’s Hospital Boston, USA has initiated the CLARITY challenge (Children’s Leadership Award for the Reliable Interpretation and appropriate Transmission of Your genomic information), in which selected research teams from around the world will compete "to identify best methods and practices for improving the reliability and accuracy of the genomics-to-clinic pipeline – spanning sequencing, analysis, interpretation and reporting – to provide the most meaningful results to patients and their families" . Initiatives like this are still in preliminary stages, but are an important step towards the goal of advancing patient treatment with appropriate use of NGS technologies.
The use of NGS in clinical settings is sure to increase in the future, driven both by the falling cost of sequencing and the potential for improved treatment in specific cases. Having detailed individual genetic information could also enable preventive medicine based on individual disease risk. However, there remain a number of challenges, both technical and practical, to such use of NGS. Technically, even "whole genome sequencing" does not currently determine an individual’s full genome sequence, but closer to 90% of it. A variety of factors account for this, including difficulties sequencing certain regions, difficulties aligning repetitive regions to the reference human genome, and limitations of current technologies for determining structural variations (rearrangements, inversions) and copy number variations (large insertions or deletions, and repetitive sequences). Practically, returning information to individuals about their genome sequence variants could consume large amounts of clinical resources. Each individual has thousands of variants just in protein-coding regions of the genome, and the great majority of these will be of unknown significance. Perhaps even more challenging, in databases that associate genetic variants with specific diseases, 10% or more of the entries may be incorrect, suggesting that a variant that is, in fact, benign, is associated with disease . To use NGS data in routine clinical evaluations will require more comprehensive and validated databases of disease-associated genetic variants, improved analysis tools to produce error-free lists of variants and determine the likely significance of each, and new approaches to deliver information to patients about potential disease-causing variants in their genes. While the cost and speed of DNA sequencing is nearing the point where its clinical use could become widespread, making good use of the generated data will become the greater challenge.
1. Wetterstrand KA. DNA Sequencing Costs: Data from the NHGRI Large-Scale Genome Sequencing Program Available at: www.genome.gov/sequencingcosts. Accessed 2012-03-26.
2. Varmus H. Ten Years On — The Human Genome and Medicine. N Engl J Med 2010; 362: 2028-2029.
3. Cooper GM and Shendure J. Needles in stacks of needles: finding disease-causal variants in a wealth of genomic data. Nat Rev Gen 2011; 12: 628-640.
4. Schwartzentruber J et al. Driver mutations in histone H3.3 and chromatin remodelling genes in paediatric glioblastoma. Nature 2012; 482: 226–231.
5. ClinSeq, www.genome.gov/20519355.
6. Biesecker LG. Opportunities and challenges for the integration of massively parallel genomic sequencing into clinical practice: lessons from the ClinSeq project. Genet Med Advance online publication 16 Feb. 2012.
7. The Clarity Challenge: genes.childrenshospital.org.
McGill University and Genome Quebec Innovation Centre
740 Docteur Penfield Ave
Montreal QC, H3A1A4
Department of Human Genetics
H3A 1A4, Montréal