Structure Encoding in DNA — What is the Junk DNA used for?

Antony Van der Mude
Sharing Science
Published in
8 min readJul 18, 2021

--

To this day, there are people who consider most of the genome as “Junk DNA”.

There are believed to be about 25,000 genes in the human genome that code for proteins. A gene is a sequence of codons. Each codon represents an amino acid. A protein is made up of amino acids and does the basic work of the cell.

But this is only about 1% of the genome. There include other elements, such as introns that help regulate the expression of genes. But the rest of the genome, over 90%, does not code for genes. Much of the rest includes things such as transposable elements (transposons)[1] and long sequences of non-coding DNA that repeat over and over with slight variations. Transposons, also called “jumping genes” are DNA sequences that change their position in the genome.

Now, take a look at the hard drive of a modern computer, or for that matter, a smartphone. Most of these devices have an operating system that is either based on Unix or on the Microsoft Windows operating system.

The operating system is made up of thousands of assembly language programs. Assembly language is a sequence of machine code instructions. Each instruction represents an operation that the computer should take. An operating system program is made up of machine code instructions, and performs a basic function of the operating system.

But only a fraction of the computer memory is taken up by the operating system programs. There are also shell scripts that determine which operating system programs to run. Most of the memory is not used by the operating system at all. Much of the rest includes application programs and items important to the user, such as pictures and documents.

Map of a computer hard drive. The operating system is in the lower right. Note that the author’s CD library is stored here and lots of newspaper articles, but not many images!

To save memory, computers often compress data by reducing redundancy, such as expressing repeating elements by a shorter code. But this has problems if the medium is prone to errors. If the representation of the repeating elements is changed, this can make the whole data set unreadable. Uncompressed data does not have this problem. An uncompressed document or a picture can have a spot that has gotten garbled, but the rest of the data is still understandable.

But the uncompressed data, especially things like pictures, have sequences that repeat over and over with slight variations. Each line of the picture is repeated in the next line, with new things being added as you scan the picture.

In 2020, I published an article in the Journal of Theoretical Biology entitled “Structure Encoding in DNA” (purchase required, free preprint here) that described the purpose of the transposons and non-coding sequences — a major part of the “junk” DNA — to specify body parts. This essay is a description of that article. Currently, this is just a hypothesis — it needs to be validated (or invalidated) by laboratory tests.

This is a non-technical article, so the references and illustrations are mostly from Wikipedia entries. For further information, check the references in the Wikipedia entries themselves — or the references in my paper.

Animals and plants are multi-cellular organisms. There must be a way to express how the parts are structured and how they fit together. The process of creating the structure is called morphogenesis[2]. Certain proteins, such as the Hox genes[3] were shown to react to substances called morphogens. Different concentration of morphogens leads to differences in structure. A number of different Hox genes, for example, determine the different sections of the heart. But this is an analog, not digital, way of determining structure and lacks precision. One of the pioneers in the field of morphogens, Lewis Wolpert, along with Michael Kerszberg, published a paper in 2003 that acknowledged that morphogens are insufficient to define structure[4].

Morphogenetic molecules do exist, but it seems improbable that their concentration alone determines the fate of cells regarding their final position in the developing embryo…Thus, morphogens may represent a rather crude positional information system, which is then more finely tuned by cell — cell interactions. Clearly, the morphogen gradient does not act alone and is itself specified by a variety of complex cellular mechanisms.

Homeobox (Hox) gene expression in Drosophila melanogaster. Image by PhiLip

The DNA must encode for three-dimensional body part structures. The gross structure is specified by the Hox genes. But the fine structure must be encoded in the DNA as well. This encoding is more complex than a picture stored in a computer memory, which is two-dimensional. That is why the transposons, which were originally invaders, were pressed into use (exapted) by the cell to give more precision to the structure of multi-cellular organisms that morphogens alone would provide. The ability of these sequences to “jump” makes them useful for defining the three-dimensional structure of an organism from a single thread of DNA.

Further information is supplied by the long non-coding DNA. This data is not in a compressed form because the process of replication is error-prone and data compression does not tolerate errors well. This is also the source of mutations that are fundamental in the process of evolution.

Transposon Structure and Mechanism of transposition. Image by Mariuswalter (CC BY-SA 4.0).

If this is the case, then transposons should be associated with Hox genes. And this is actually what happens. Included in the sections of the DNA where the Hox genes are found are clusters of transposons. It might be possible for there to be a hierarchy — the Hox genes turn on the transposons, which then cascade down to other transposons for finer structure.

This process is controlled through methlyation[5]. Methylation turns off the transposons. At any given time, most of the transposons in the cell are methylated. But the transposons that are active are demethylated. Also, since the transposons are actively manipulating the DNA, the repair mechanism of the cell is most likely turned off at certain times so the transposons can do their job.

Typical DNA methylation landscape in mammals. Image by Mariuswalter (CC BY-SA 4.0)

The transposons work in concert with the long-noncoding DNA to perform two different types of functions. First, to provide the three-dimensional structure, cells are commanded to self-destruct (apoptosis). Second, of the cells that remain, different cells must perform different functions in a body part. This is done by turning off and on the individual genes. This is controlled by the Gene Regulatory Network[6], the hierarchy of genes that determine the functioning of each individual cell.

Therefore the transposons are defining the detailed three-dimensional structures in the body. They give the outlines of the body part by defining the spaces and voids. Of the cells in the structure, they define the function of the cell by regulating which genes are turned on.

This is a totally different way of encoding data in DNA from the encoding of proteins. The biggest difference is that, to define a protein, the whole sequence of codons is important, but for an individual cell in a structure, all that is important is to determine where in that structure the cell is and what type of cell it should be. It is like the transposons are doing origami on the DNA, but instead of using a two-dimensional sheet, the transposon is doing origami on a spool of thread. This explains why it is so important that the transposon “jumps” from one place to another in the sequence, since it is determining a point in a three-dimensional structure from data expressed as a one-dimensional sequence.

Hierarchical organization of pluripotency gene regulatory network. CC BY-NC-ND 4.0

The transposons perform their work in the chromatin[7]. Chromatin has two forms, euchromatin which is lightly packed and heterochromatin[8] which is tightly packed. The euchromatin is where the genes are expressed since they have to access the full sequence of a protein. The heterochromatin needs only to look up the location of the cell in the overall structure, so it only needs to access the data unique to that structure. This information is passed to the histones in what is known as the histone code.

The operation of the Hox genes and the transposons is part of the process of epigenetics[9]. A fetal body part starts as a cluster of stem cells that can become any cell in that body part. As the cells divide, the Hox genes and transposons get turned on, which further and further differentiates the cells, leading to pluripotent cells and finally fully differentiated cells.

Epigenetic mechanisms. Image from National Institutes of Health — Public Domain

This process of differentiation requires that information be passed between cells to help determine their location in the structure. The morphogens provide gross information. Finer information is passed through exosomes[10] — capsules that contain many different molecules. Exosomes are known to carry both transposons and long noncoding RNA.

Exosome carrying a protein. Image by Guillaume Pelletier (CC BY-SA 4.0)

To repair body parts a certain number of stem cells are held in reserve. Using the information passed in the local environment, the process of epigenetics can generate fully differentiated cells to replace the ones that die off.

So this is a hypothesis about how the three-dimensional structure of plants and animals are determined. Note that this does account for all of the genome. There still can be a certain amount of the genome that is truly junk, just like a computer memory contains sections that contain leftover junk from deleted files and operating system processing. The genome contains non-functional transposons, for example. It is quite likely that, since these sections of DNA are not useful, they will eventually disappear from the genome in the normal process of evolution.

Reference and Footnotes

A. Van der Mude, “Structure Encoding in DNA”, Journal of Theoretical Biology, Vol. 492, 7 May 2020, 110205 https://doi.org/10.1016/j.jtbi.2020.110205

  1. https://en.wikipedia.org/wiki/Transposable_element
  2. https://en.wikipedia.org/wiki/Morphogenesis
  3. https://en.wikipedia.org/wiki/Hox_gene
  4. Kerszberg, Michel, and Lewis Wolpert. “Specifying positional information in the embryo: looking beyond morphogens.” Cell 130.2 (2007): 205–209. https://doi.org/10.1016/j.cell.2007.06.038
  5. https://en.wikipedia.org/wiki/DNA_methylation
  6. https://en.wikipedia.org/wiki/Gene_regulatory_network
  7. https://en.wikipedia.org/wiki/Chromatin
  8. https://en.wikipedia.org/wiki/Heterochromatin
  9. https://en.wikipedia.org/wiki/Epigenetics
  10. https://en.wikipedia.org/wiki/Exosome_(vesicle)

--

--

Antony Van der Mude
Sharing Science

Computer programmer, interested in philosophy and religious pantheism