DNA Sequencing

Frederick-Sanger

Frederick Sanger. Wikipedia

Frederick Sanger was a British biochemist. He is known for the first sequencing of a protein (1955) and a method for sequencing DNA that bears his name, the Sanger Method. Sanger won two Nobel Prizes in Chemistry [1].

In this post we’ll talk about one of the first steps of DNA analysis, DNA sequencing, which is obtaining the data from the DNA, how it’s performed (we’ll focus on the Sanger method) and some interesting computational problems associated with it.

This is the second post in the series of Biochemistry studies from a Computer Scientist perspective. Our first post is a brief discussion of basic concepts in Cell Biology.

DNA Sequencing

DNA sequencing is the determination of the physical order of the nucleotide bases in a molecule of DNA.

The first living organism to have its genome sequenced was a bacteria, Haemophilus influenzae, whose genome is about 1.8 million base pairs.

Genome represents the whole set of  genetic information of an organism. For a bacteria, it’s singular circular chromosome, but for humans it’s the set of all 23 pairs of chromosomes. A base-pair (shortened as bp) refers to a pair of nucleotides (bases) that are bound together in the double-strands of DNA.

In the human genome, there are 3 billion of base-pairs, and it took 13 years for it to be completed.

There are two main methods of sequencing, Sanger and Next-generation [5]. We’ll talk about the Sanger in details and discuss briefly the Next-generation from a real-world use case.

Sanger Sequencing

The Sanger method is able to determine the nucleotide sequence of small fragments (up to abound 900bps) [5] of DNA.

Overview

The first step is cloning the fragment into multiple copies (like billions) by a process called amplification. This is essentially mimicking the DNA cloning process in an artificial setup. In very high level we have:

  • Separate the double strands (denaturation)
  • Add a special molecule (primer) to the extremity of each strand
  • Extend the primer with nucleotides until it reaches the other extremity

Once this is complete, we end up with two double-stranded fragments. We can repeat the process to generate many copies of the original fragment (theoretically doubling at each step). This process is known as Polymerase Chain Reaction (PCR).

Once we have enough copies of the fragment, we do a similar process, but this time, we also allow extending the primer with a special nucleotide named (dideoxy nucleotide). The key difference is that once it’s added, a dideoxy nucleotide cannot be further extended, and it also contains a special marker that causes each different base to have a different color. The process is now the following:

  • Separate the double strands (denaturation)
  • Add a special molecule (primer) to the extremity of each strand
  • Add to the primer either
    • Regular nucleotide (non-terminal – can be further extended)
    • Dideoxy nucleotide (terminal – cannot be further extended)

Now, instead of ending up with more clones, we’ll have fragments with a variety of lengths.

We then run theses fragments through a process called Capillary Gel Electrophoresis which roughly consists in subjecting the fragments to a electric field, which then causes fragments to move with speed proportional to their length (the smaller the fragment the faster it moves). Once a group of fragments (which have the same length) reach a sensor at end of the field, we make use of the color marker in the special nucleotide at the tip of the fragment to determine the base of that dideoxy nucleotide. This enables us to determine the sequence of the nucleotides in the original fragment!

To given an example, say that the original sequence is GATTCAGC. Because there are many copies of the fragment after amplification, it’s very likely that we’ll end up with fragments with all possible lengths, from 1 to 8. Since a fragment of length 1 moves faster than any other, it will reach the sensor first. We know that a fragment with length 1 is a base pair G-C, and C is a dideoxy nucleotide (if it was not, it would have continued extended further). Say the color marker for C is orange. Then the first color to be captured by the sensor will beorange. The second set of fragments to reach the sensor is of length 2, which is a fragment (G-C, A-T), where T is a dideoxy nucleotide. If we assume it has color red, that’s the color which will be captured next by the sensor. You can see that based on the colors captured by the sensor we can infer the nucleotide sequence in the original segment.

Screen Shot 2018-08-28 at 8.21.24 PM

Fragments with different lengths and color markers. Image copied from Whole-Genome Sequencing

Details

Let’s go in more details for these processes. For both cases we work with solutions. Finer grained manipulation of molecules is infeasible.

To separate the double strands we heat the solution up to 96ºC, in a process we call denaturation. The high temperature causes the hydrogen bonds between pairs of nucleotides to break.

In the same solution we have the primer molecules (aka oligonucleotides) which are carefully chosen to match the beginning of the DNA strand. They also bind to DNA at a higher temperature than the strands (e.g. 55ºC). This is important because we can now lower the temperature of the solution slowly, enough so that primers can bind, but not so low to the point where the original strands will join back together. This slow cooling is called annealing. These primers can be synthesized artificially.

Gap in understanding: how to choose the right primer, since we need to know at least some of sequence from the nucleotide in order to know which primer will end up binding there? One possible explanation is that we know the exact sequence where a DNA was cut if we restriction enzymes, since their binding site is known [9] and hence we have an idea of the result endpoints after the cut.

We also add an enzyme, DNA polymerase, and nucleotides to the solution. The enzyme is able to extend the original primer segment by binding free nucleotides in the solution. In particular we use the enzyme of a bacteria that lives at 70ºC (Thermus Acquaticus), also know as Taq polymerase because it is functional at higher temperatures. Performing the replication at a higher temperature prevents the separated strands from gluing together again.

The dideoxy nucleotide are modified versions of the regular nucleotide by supressing the OH group. They can still be incorporated to the primer via the DNA polymerase, but they prevent other nucleotides to binding to them via the sugar-phosphate binding.

In addition these dideoxy nucleotides contain a fluorescent molecule whose color is unique for each different type of base. How are these molecules “inserted” into the nucleotides? The abstract of [6] states:

Avian myeloblastosis virus reverse transcriptase is used in a modified dideoxy DNA sequencing protocol to produce a complete set of fluorescence-tagged fragments in one reaction mixture.

which suggests it’s possible to synthesize them by using a specific virus.

Screen Shot 2018-08-28 at 8.44.50 PM.png

Dideoxynucleotide vs deoxynucleotide. The lack of the OH group prevents a ddNTP from binding to the “next” nucleotide, effectively terminating the chain. Image copied from Whole-Genome Sequencing

Next-Generation Sequencing (Illumina)

The Sanger method is a very slow process, making it infeasible for analyzing large amounts of DNAs such as the human genome.

Modern sequencers make use of the “Next-generation” methods, which consist in massive parallelism to speed up the process. The most advanced sequencer in the market is produced by a company called Illumina. As of the time of this writing, their top equipment, Hiseq X Ten, costs about $10 million dollars, can sequence about 18k full genomes a year and it costs about $1000 per genome [2, 3].

Illumina’s educational video [4], describes the process:

  • Cut a long sequence into small fragments
  • Augment segments with metadata (indices) and adapters
  • Have the segments attach to beads in a glass plate via the adapters. The beads are basically the primers.
  • Amplify the segments (massive cloning)
  • Extend the primer with fluorescent nucleotides
    • Every time a nucleotide is added, it emits light which is captured by a sensor

After the process is over, we have the sequence of many fragments based on the colors that were emitted.

We can see that the principles resemble the Sanger method, but it uses different technologies to allow a very automated and parallel procedure.

This whole process is very vague and it’s hard to have a good understanding of it. It’s understandable given that a lot of the information is likely industry secret.

Sequencing vs Genotyping in personal genomics

Some of the most popular personal genetic analysis companies, such as 23andMe, provide a service in which they analyze the user DNA for a small fee. It’s way cheaper than the full genome analysis provided by Illumina, but that’s because these companies don’t do DNA sequencing, but rather genotyping.

Genotyping is the process of determining which genetic variants an individual possesses. This is easier than sequencing because a lot of known diseases and traits can be traced back to specific regions and specific chromosomes.

This is the information you most likely want to know about yourself. Remember that the majority of DNA in complex organisms is not useful (introns). In humans genome, exome (the part of DNA consisting of exons) account for less than 2% of total DNA.

Sequencing technology has not yet progressed to the point where it is feasible to sequence an entire person’s genome quickly and cheaply enough to keep costs down for consumers. It took the Human Genome Project, a consortium of multiple research labs, over 10 years to sequence the whole genomes of just a few individuals.

Sequence Assembly Problem

Current technology is unable to sequence large segments of DNA, so it has to break it down into small fragments. Once that is done, we need to reconstruct the original sequence from the data of individual fragments.

There are two scenarios:

  • Mapping: matching fragments of DNA to existing sequences by some similarity criteria
  • De-novo: reconstruct the DNA when there’s no prior data about that sequence.

There is a family of computational methods for the mapping case known as Sequence Assembly, which we’ll study in more details in a future post.

Gap in understanding: It’s unclear what exact information is expected for the De-novo sequencing. It is impossible to determine the original sequence by only having data about individual segments, in the same way it’s impossible to reconstruct a book by only knowing it’s paragraphs contents (but not their order), borrowing the analogy from Wikipedia [8].

One thing I can think of is that if we repeat the experiments multiple times, each time cutting the molecules at different places, it might be possible to infer the relative order of the segments if they’re unique enough.

For example, if the original segment is: GATTCAGC and we run two experiments, one yielding (GAT, TCAGC) and another (GATTC, AGC), then in this case there’s only one way to assemble (order) these sequences in a consistent way.

Conclusion

In this post we studied some aspects of DNA sequencing. I found the Sanger method fascinating. In Computer Science ideas usually translate to algorithms and code very directly, but in other science branches the mapping from ideas to implementation is a problem in itself.

In Mechanics for example, we have to work with the physical world, so when converting from a circular movement to a linear one requires some clever tricks.

This needs to be taken to another level in Molecular Biology, because we don’t have direct access to the medium like we do in a mechanical device, for example, we can’t directly manipulate a double strand of DNA fragment to separate it, but have to resort to indirect ways.

The field of Biotechnology seems to be progressing at such a pace that it was challenging to find good sources of information. I’m yet to find a resource that explains end to end the steps from the process that takes a sample of cells and outputs the DNA nucleotides to a computer, including the technical challenges in doing so. This is what this post aimed to do.

References

[1] Wikipedia – Frederick Sanger
[2] Genohub – Illumina’s Latest Release: HiSeq 3000, 4000, NextSeq 550 and HiSeq X5
[3] Illumina Hiseq-X
[4] Youtube – Illumina Sequencing by Synthesis
[5] Khan Academy – DNA sequencing
[6] Science Magazine: A system for rapid DNA sequencing with fluorescent chain-terminating dideoxynucleotides
[7] Towards Data Science – DNA Sequence Data Analysis – Starting off in Bioinformatics
[8] Wikipedia – Sequence assembly
[9] Dummies – How Scientists Cut DNA with Restriction Enzymes

Advertisements

Cell biology and programming

rosalind

Rosalind Franklin was an English chemist and X-ray crystallographer. She is best known for her work on the X-ray diffraction images of DNA, particularly Photo 51, while at King’s College, London, which led to the discovery of the DNA structure by Watson and Crick.

James Watson, Francis Crick and Maurice Wilkins shared the Nobel Prize in Physiology or Medicine in 1962. According to Wikipedia [1], Watson suggested that Franklin would have ideally been awarded a Nobel Prize in Chemistry four years after Franklin passed away due to ovary cancer.

photo_51

Photo 51

In this post we’ll study some basic concepts of cell biology, mostly around the DNA. We’ll start by introducing the structure of the DNA and then two of its main functions: replication and synthesis of protein. Since this is a programming blog, we’ll provide analogies (some forceful) to computer systems to help us relate to prior knowledge.

The end goal of this post to provide a good basis for later learning bio-informatics algorithms.

Genome

Genome is the set of information necessary for creating an organism. In a computer programming analogy we can think of the genome as the entire source code of an application.

Chromosome

Let’s recall that cells can be classified into two categories, eukaryotic (from the Greek, good + nucleus) and prokaryotic (before + nucleus). As the name suggests, eukaryotic cells have a well define nucleus surrounded by a membrane.

In eukaryotic cells the genome is divided across multiple chromosomes which are basically compacted DNA. They need to be compacted to fit within the nucleus. According to Science Focus [3], if stretched, the DNA chain would be 2 meters long.

Humans cells usually have 23 pairs of chromosomes, 22 of which are called autosomes and they are numbered based on size (autosome 1 is the largest). The remaining is a sex chromosome and can be of type either X or Y.

Men and women share the same types of autosomes, but females have two copies of chromosome X, and men one chromosome X and one Y.

Chromosomes are usually depicted as X-like structures and neatly arranged [4], but recent research was able to visualize the actual structure and it looks more like:

chromosome.jpg

Prokaryotes (e.g. bacteria) on the other hand typically store their entire genome within a single circular DNA chain.

In our computer programming analogy, the chromosome serve as units of organization of the source code, for example the files containing the code. If your application is simple, like a bacteria, the entire source code could be stored in a single file!

We can push the analogy even further and think of the tangled DNA within the chromosome as the “minification” step that JavaScript applications apply to the original source code to reduce network payload.

DNA

The deoxyribonucleic acid, more commonly known as DNA is a structure usually composed of two strands (chains) connected through steps to form a double helix.

dna-concept

Conceptual representation of the DNA: the double helix

In our analogy the DNA could represent the text of the source code.

Nucleotide

Nucleotides are the discrete units that form the base of the DNA. In the DNA it can be one of: Adenine, Cytosine, Guanine and Thymine.

Chemically speaking, we can divide the nucleotide in 3 parts: a sugar group, a phosphate group and a nitrogen base. The first two are common among the nucleotides, while the base differentiates them.

Screen Shot 2018-04-28 at 10.45.57 PM

Guanine, one of the 4 possible nucleotides in the DNA

Any two nucleotides can be connected through their sugar and phosphate groups (the sugar group of one nucleotide links to the phosphate group of the next). Nucleotides linked this way form the backbone of a single strand of the DNA.

In addition, two nucleotides can be linked together via their nitrogen base, but in this case, there’s a restriction on which two bases can be linked together. Adenine can only be paired with Thymine, and Cytosine can only be paired with Guanine. These pairings form the “steps” in between two DNA strands.

Screen Shot 2018-04-28 at 11.09.01 PM

4 nucleotides linked together through the sugar-phosphate groups or through the nitrogen bases.

Because the phosphate group of a nucleotide is linked to the sugar group of the next, we can define a direction for a chain of nucleotides. The endpoint that ends with the phosphate group has 5 carbon molecules and is called 5′ (read as five prime), while the sugar group has 3 and is called 3′ (three prime). The two strands in a DNA molecule are oriented in opposite directions, as depicted in the figure above.

We can now complete the analogy of computer programming by stating that nucleotides are the characters that compose the text of the source code (DNA). In our case, the alphabet contains 4 letters: A (Adenine), C (Cytosine), G (Guanine), T (Thymine).

The replication

The DNA is capable of replicating itself so that it can be passed down to new formed cells.

In high-level, the double strands of the DNA start separating and other structures start binding new nucleotides to each strand (templates) until both the strands are completely duplicated (and form a double strand again).

The separation of the strands is triggered by the protein helicase, and can happen at any point of the DNA and it might happen in many places at the same time. One way to visualize this is opening a zipper jacket from the middle and keep pushing it open in one direction.

While the strands are being separated, proteins called DNA polymerase starts scanning each strand and making a copy strand by adding nucleotides to it.

The DNA polymerase can only extend an existing chain of nucleotide, so it requires an initial fragment to start with. For that reason, in the beginning of the duplication process, a small fragment of DNA or RNA, called primer, needs to be connected to the strand.

One important limitation the polymerase has is that it can only process a strand in one specific direction: from 3′ to 5′. But since we saw that strands are oriented in opposite direction of each other, it means that the replication process doesn’t happen symmetrically on both strands.

For the strand oriented 3′ to 5′ the replication is smooth, generating a continuous strand. For the reverse strand though, it will be done in pieces, because the polymerase is adding nucleotides in the opposite side of where the opening is happening. These pieces are known as Okazaki fragments and later joined together by another protein called ligase.

We can see these two cases taking place in the picture below:

dna-rep.png

One interesting fact about this process is that errors do happen and there are error corrections in place to minimize them. The result of the replication is also not 100% accurate, especially at the endpoints of the strands, where each new replica formed has its endpoints shorter than the original template. To compensate for this, the DNA has repeated redundant segments of nucleotides at the endpoints, know as telomeres, but eventually this extra segments get wore off to a point they cannot be used for replication. The shortening of telomeres is associated with aging.

In our analogy to computer programming, we could imagine the replication being the distribution of copies of the source code to other people. This analogy is weak, though. If we want to be precise, we’d need to come up with a program that is capable of printing its own source as output. This was named Quine by Douglas Hofstadter in Gödel, Escher, Bach [6] and it’s an example of a self-replicating automata, also studied by the famous computer scientist John von Neumann.

Protein Production

A second function of the DNA is the production of proteins. The first step, called transcription, is very similar to replication: One of the strands serve as template, and a new strand with complementary base is generated. The difference is that instead of Thymine, the nucleotide Uracil is used. The resulting structure is called mRNA, short for messenger RNA.

MRNA-interaction

Production of mRNA and its exit to the cytoplasm

The mRNA detaches itself from the DNA strand and exits the nucleus to the cytoplasm where the second step, translation, begins. In there, it is “interpreted” by a structure called ribosome. Every 3 nucleotides, denominated codon, translates to an amino acid, which in turn form a protein. The mapping of every possible 64 combinations of codons are displayed below:

dna-152136_640.png

Mapping of codons to amino-acids. For example, GGA maps to Glycine.

The tRNA, short for transfer RNA, is a small chain of RNA that connects amino-acids to their corresponding codon in the mRNA. There are some special codons that indicate the end of the process. The output of this translation will be a peptide chain.

translation

Synthesis of a protein

The production of protein is the actual functioning of the DNA. If we link to the computer programming model, we could think of the source code (DNA) being interpreted or compiled into an actual program that can be executed. It’s incredible how biological systems evolved to define an explicit code for the end of the translation, much like how a semi-color or new line often indicates the end of an expression.

Applications of Bio-informatics

How can computer science help with the study of biological systems? Computers are  good at repetitive tasks and handling large amounts of data. Here are some applications according to Wikipedia [7].

  • DNA sequencing – consists of determining the order of nucleotides in the DNA from raw data. Computers are useful here because the raw data often come as fragments that need to be merged.
  • Protein structure prediction – given a sequence of nucleotides, determine the chain of amino-acids is well-understood, but predicting the structure of the final protein is an open problem.
  • Reduce noise in massive datasets output by experiments.

Conclusion

From reading a lot of introductory material, I felt that there were a lot of  imprecise or ambiguous descriptions. That seems to come from the sheer complexity of biological systems and most results coming from empirical evidence, which currently only provide an incomplete picture of the whole, but new information still comes at a very frequent pace, sometimes making existing models obsolete or incorrect.

I last studied Biology back in high school. I don’t intend to study it in depth, but just enough to understand how computer science can be used to solve particular problems.

The idea of modeling living organisms as organic computers is fascinating. As we discover more about the inner workings of cells, I hope we can come up with better models and be able to predict their outcome with greater accuracy.

References

[1] Rosalind Franklin – Wikipedia
[2] Photo 51 – Wikipedia
[3] How long is your DNA?
[4] How many chromosomes do people have?
[5] What a chromosome really looks like
[6] Gödel, Escher, Bach: An Eternal Golden Braid – Douglas R. Hofstadter
[7] Bioinformatics – Wikipedia