Today’s blog post comes from guest writer Stanislav Volik, who has worked in genomics since the 1990s. His thesis for his PhD was one of the first genomics theses defended in Russia. His genomics focus has been on cancer studies, specifically on breast and prostate cancer. With two colleagues, he invented and patented a paired-end sequence approach for deciphering the structure of tumour genomes in the early 2000s, before NGS made it feasible to sequence the tumour DNA.
A look back at the history of the Human Genome Project
With the most recent release of a complete human genome by the telomere-to-telomere consortium, I found myself reflecting on the history of our collective efforts to achieve a better understanding of our genetic heritage. We could say that this year marks the coming of age for the Human Genome Project (HGP). Twenty one years ago, the first drafts of the human genome sequence were published by the public National Institutes of Health-led International Human Genome Consortium and commercial entity Celera Genomics, founded by Craig Venter. The “First Draft”, of course, was exactly that – about 90% of the euchromatic (generally gene-rich regions) were analyzed. This prompted a string of follow-up press conferences and articles, describing ever more complete versions of the whole genome sequence until about three years later on October 21, 2004 when the International Human Genome Sequencing Consortium published the penultimate paper titled “Finishing the euchromatic sequence of the human genome”. By any measure, this is one of the most towering scientific and technological achievements of the late century. One of the most interesting aspects of its completion is the way in which available technology was shaping the strategy and even politics around this monumental endeavour.
Alta, Utah – The birthplace of the Human Genome Project [Image Source]
Where it all began
The timeline of the HGP is still available on the Oakridge National Laboratory website archive. Even in its current, barely functioning form, it reveals a fascinating story of an idea that seemed impossible when, in 1984, a group of 19 scientists found themselves snowed in at a ski resort in Alta, Utah. They grappled with the problem of identifying DNA mutations in Hiroshima and Nagasaki nuclear attack survivors and their children. Existing methods could not identify the then-expected number of mutations, but the advent of molecular cloning, pulsed-field gel electrophoresis, and other wonders of technology gave everybody the feeling that the solution was possible. Charles deLisi, the newly appointed director of the )Office of Health and Environmental Research at Department of Energy (DOE who read a draft of the Alta report in October 1985, and while reading it, first had the idea of a dedicated human genome project. Next year the Human Genome Initiative was proposed by the DOE after a feasibility workshop in Santa Fe, New Mexico. In 1987, it was endorsed and the first budget estimate appeared. Finally, in 1990 National Institutes of Health (NIH) and the DOE announced the first five-year plan titled “Understanding Our Genetic Inheritance. The US Human Genome Project”. The project was announced with an approximate annual budget of $200M with a stated goal to complete the sequencing of the first human genome in 15 years (a total of $3B in 1990 dollars, an equivalent of approximately $6B today).
1980 Nobel laureates P. Berg, W. Gilbert and F. Sanger (left to right)
The Maxam, Gilbert, and Sanger race
In 1985, the concept of sequencing the whole human genome was truly revolutionary scientific thinking at its best, since no appropriate technology was ready for such a task. Four years passed since the 1980 Nobel Prize in Chemistry was shared between P. Berg for his “fundamental studies of the biochemistry of nucleic acids, with particular regard to recombinant-DNA” and W. Gilbert and F. Sanger (the second Nobel Prize for the latter) for “their contributions concerning the determination of base sequences in nucleic acids”. But it was not quite clear which of Gilbert or Sanger’s approaches to sequencing would prove to be the most efficient. Maxam and Gilbert developed a purely chemical method of sequencing nucleic acids that required many chemical steps, but could be performed on double-stranded DNA. Sanger’s approach on the other hand required single-stranded DNA. In the early and mid-1980s, both methods were still widely used, and the advantages of Sanger’s approach with its reliability (given access to high quality enzymes and nucleotides) and longer reads, was just being established. Both approaches had limited read length (approximately 200-250 bases for Maxam-Gilbert and 350-500 bases for Sanger), and required the genomic DNA to be fragmented prior to analysis. Given the realities of fully manual slab gel sequencing, this meant that determining a sequence of a single average human mRNA was an achievement worthy of publication in a fairly high impact journal. With an average time for analysis of a ready-to-sequence DNA fragment of ~6 hours, an average read length of 350-500 bases, and 10-20 DNA fragments analyzed per slab gel, the throughput for a qualified post-doc at that time achieved a whopping 1.7-2.0 kb per hour. With a haploid human genome size of ~3 billion bases, one was looking at the very minimum of 171 years for a single station to sequence perfectly ordered, minimally overlapping fragments that could then be assembled into the final reference sequence.
Mapping it out
There was one caveat – this set of minimally overlapping genomic DNA fragments did not exist yet. It was not immediately clear if anybody was able to create one or how to order those into a full sequence, given the fact that the human genome contained numerous highly repetitive sequences that were longer than an average read length of existing technologies. It became apparent that an absolute prerequisite of achieving the stated goal of creating a human genome reference sequence was to have a physical map of the genome that would contain information on the order and physical spacing of some genomic features that could be identified in sequenceable fragments. This would allow ordering of a multitude of reads necessary to determine the human genome sequence. Consequently, much effort was spent by the broad scientific community over the course of the next 14 or so years (counting from the fateful Alta meeting) on developing ever more detailed sets of human genome physical maps and ever more complete libraries of ever larger DNA fragments (clones) that were being produced and mapped back to the genome using more sophisticated molecular biology techniques. This work was very much supported by the scientific community, not only because it was deemed absolutely necessary for the success of the project, but also because it was “fair”, allowing even relatively small groups to meaningfully contribute to the success of this huge endeavour.
Sanger wins and gets automated
In parallel with the massive efforts aimed at creating a comprehensive physical map of a human genome, a lot of effort was focused on streamlining and then automating DNA sequencing in order to drastically increase sequencing throughput. Sanger sequencing won this battle since it proved to be easier to automate – no complicated chemical reactions were required, and as an additional bonus, it offered longer read lengths. But the most important factor was that biological machinery of DNA synthesis used by this technology proved to be sufficiently robust and versatile to allow for labeling the nucleotides with first biotin and later fluorescent dyes, obviating the need for radioactive labeling. In 1984 Fritz Pohl reported the first method for non-radioactive colorimetric DNA sequencing. In 1986 Leroy Hood’s group published a method for automated fluorescence-based DNA sequence analysis, a technology that allowed Applied Biosystems to offer the first automated DNA sequencers (ABI370/373), a machine that enabled first massive sequencing projects, such as the effort to catalog all expressed human genes using “Expressed Sequence Tags” (EST). In 1995, another breakthrough instrument was released (ABI Prism 310) that did away with the pesky problem of pouring flawless big and thin (down to 0.4mm thick) gels that greatly simplified and sped up the sequencing process. Finally, in 1997, the ABI3700 capillary sequencer was released, that boasted 96 capillaries, a configuration that “gives the 3700 system the capacity to analyze simultaneously 96 samples as often as 16 times per day for a total of 16 × 96 = 1,536 samples a day”, as the ABI brochure touted. In other words, users could expect to receive a whopping 768Kb of sequence daily.
Venter causes outrage
This unprecedented increase in sequencing capacity suddenly made another approach feasible – de novo sequencing of complex genomes without construction of ordered genomic fragment libraries, and without a long and very expensive process of physical mapping, an approach that came to be known as “shotgun” sequencing. The theoretical feasibility of such an approach was established in 1995 by Leroy Hood’s team. In a paper titled “Pairwise End Sequencing: A Unified Approach to Genomic Mapping and Sequencing”, they demonstrated that a large complex genome can be sequenced using just a collection of randomly cloned fragments of at least two very different sizes, that would be randomly subcloned, sequenced, and ordered based on the identification of these paired end sequences in the contigs assembled from subclones. A mere two years later, in 1997, Craig Venter, the founder of The Institute of Genome Research and then-Celera Genomics, announced that his team will “single handedly sequence human genome” in just three years for $300M or 1/10th of the originally estimated cost of the public International Human Genome Project.
Needless to say, Venter’s announcement caused an uproar in the genomics community. First, it appeared to obsolete all the huge efforts spent on physical map construction and ordering clone libraries. Second, it put leaders and political supporters of the public HGP in a really bad light: after spending 10x Venter’s budget and working on the project for seven years since its official launch in 1990, their proposed timeline for releasing the draft sequence was still seven years away (2005). And, finally, the scientific community was outraged by Venter’s plans to offer paid access to the genomic sequence to commercial entities. I still remember the charged atmosphere at the Cold Spring Harbor meeting in 1997 when Venter made his announcement. Nobody knew the details (no internet as we know it today), only rumors about closed door talks between NIH and Wellcome Trust. It was very late that day around 10 pm, when Craig Venter came on the podium trying to present his idea. He was essentially booed off the stage by the outraged audience. Then-NIH director Francis Collins and then-head of Wellcome Trust came on the podium and proclaimed that the public HGP won’t be beat; that Wellcome Trust will devote whatever resources needed to ensure “competitiveness” of the public HGP and to ensure that everybody will have a free and unfettered access to its results.
Craig Venter (left) and Francis Collins (right) with former US President Bill Clinton to announce the first map of the Human Genome Project [Image Source]
Be it as it may, Venter’s initiative did result in a substantial reevaluation of the HGP strategy. In the end both teams (Venter’s and HGP) ended up using hybrid shotgun plus physical mapping information for the first human genome assemblies resulting in groundbreaking simultaneous 2001 publications. And animosity towards Craig Venter didn’t last long in the genomic community – a few years later many of the people who booed in 1997 were applauding his talk to the same audience devoted to the first large scale metagenomic project.
Looking back over the many years of my professional life, witnessing the completion of the first HGP was surely an experience of a lifetime. Essentially, the HGP set a new paradigm in biological studies serving as a prime catalyst of developing revolutionary new technologies, that became tectonic forces in their own right, obsoleting some massive efforts, yet opening many new paths. This pattern continued, with the next topic to be addressed, focusing on the actual genetic diversity of humans and how we can use this knowledge to meaningfully impact our lives. This could not be accomplished using those first generation sequencing technologies that enabled the HGP’s success. The next phase of breakthroughs followed, which led to the emergence of next generation sequencing (NGS) technologies, which finally made it a routine, not only to sequence individual genomes, but allowed studying genomes and transcriptomes of single cells. Stay tuned for our next blog post, where we dive deeper into the next phase in technology development – NGS.