Friday, February 01, 2008

Most DNA is not Junk

One of the things that software developers hate most is providing estimates for how much it will cost to develop a piece of software. It’s hard to answer with any accuracy a question like that but we have to do it so that bean counters can make business decisions. The reason why the question is hard to answer is because there are a lot of unknowns and we have to make guesses about those unknowns based on previous experience. Books have been written on estimating software development costs but a simplified description of the process goes like this:

1) Break the development tasks up into subtasks.
2) Estimate how much it will cost to do these subtasks based on the costs of similar tasks completed previously.
3) Add the estimates of the subtasks to get an estimate of the total.

As you can see there is a lot of guess work involved. It’s not uncommon for such estimates to be off by more than a factor of two. Perhaps they should be called guesstimates.

In my blog entry Junk DNA is a Myth I spouted off about how it was ridiculous to think that 97% of our DNA is junk. I could believe 5% junk due to entropy but not 97%. This blog entry came under criticism by Professor Scott Page. In his criticism, he never provided any proof that the vast majority of DNA is junk, just ridicule. This ridicule may have been a knee-jerk reaction to my blogging alias “Intelligent Designer”. Scott Page believes that anyone who is a creationist is an idiot.

In my defense I am going to make a stab at guesstimating a plausible amount of non-junk DNA in the human genome. I can already hear Scott laughing away in his office now. I’ll have to make several assumptions to come up with this estimate. I’ll italicize these assumptions as I go along. I also plan to make revisions to this estimate as I think through it so please don’t quote me until I am done. I am publishing this draft because I am hoping someone with expert knowledge will stumble on it and chime in with useful information.

So let’s begin. In this estimate I will be using the word “information” to denote DNA that is not junk and “data” to denote DNA which may or may not be junk. I will also be talking about the data in terms of bytes and MBs [megabytes]. A nucleotide can be represented with two bits of data, a string of 4 nucleotides by a byte of data, and 4 million nucleotides by a MB of data. Thus 3.2 billion base pairs of the human genome is equivalent to 800 MB of data. Professor Page believes the human genome has only 24MB of information and that the rest is junk – that makes me laugh.

According to Professor Larry Moran "a bacterial genome is about 4 million base pairs and there's no junk". So I think it is safe to say that there is at least 1MB of information in the human genome.

Now there are 210 known cell types in the human body. I’ll assume that each cell type requires at least 1MB of information. These cell types share a lot of common features so I’ll assume there is a lot of common information. Just how much of the information is shared between these cell types is a guess. I am going to assume that 90% of the information in each cell type is shared and 10% is unique. This means that 210 cell types require 1MB + 209 * .1MB of information. Rounding this implies that there is at least 22MB of information in the human genome.

But this is just the information needed to construct the different cell types. More information is needed for spatial orientation and to coordinate activity among cells to perform complex functions like vision, motor control, digestion and tissue repair. Since the most efficient algorithms to just sort n objects have an order of nlog(n) I am tempted to guesstimate by multiplying 22MB by log(210) to get a lower bound. But that would be bad applied math and just plain lazy. But then again I am not exactly getting paid to do this (wink).

I can think of two other approaches that could be taken. For one of them I need some data points. In particular I need size data about genomes of the simplest multicellular life forms that are well studied and believed not to have junk.



At 12:35 PM, February 19, 2008, Blogger Doppelganger said...

Funny stuff...

You failed uttelry to define your terms and also failed to get the point of my post. Poor analytical skills, I guess.

The main point of my posts were to demonstrate the ignorance-based hubris exhibited by creationist engineer-types.

You were a big help in that arena, and continue to be.

I will be having fun demolishing this stuff.

At 9:49 AM, February 20, 2008, Blogger Chris Harrison said...

"A bacterial genome has 4 million base pairs of DNA"

Just FYI, bacterial genomes range 20 fold in size (from 450kb to over 9Mb).

With respect to eukaryotes, genomes size evolves irrespective of organismal complexity or gene number.

This is easily observed for the onion genus Allium, in which genome sizes range from 7 pg to 31.5 pg (humans are at 3.5 pg, for reference). One species of Allium is certainly not 4.5 times as complex as another.

Biologists aren't saying DNA is junk out of their own personal ignorance. Rather, the mechanisms responsible for the generation of "junk" are well known and this knowledge is what forms the basis for the conclusion that much of eukaryotic genomes are unnecessary artifacts of selfish DNA and inactivating mutations.

This is what the human genomes looks like, structurally:

Don't take this comment as "you're an idiot software designer". I'm interested in your upcoming posts, but bear in mind that biologists have good reasons to say that junk DNA exists in large quantities.

At 9:20 PM, February 23, 2008, Blogger Intelligent Designer said...

Hi Chris,

Thanks for visiting my blog. Do yo think it is fair that I use the 1Mb to approximation for bacterial genome for the purpose of this estimate? If not, what would you use?

What do you think is the most compelling reason to assume that 97% of the human genome is junk?

Actually many biologists are saying that the vast majority of DNA is junk out of ignorance. And when I say that I am not being derogatory in any way. I saying this because a human genome wasn't even completely mapped until recently and the tedious and expensive effort to understand the information collected so far has just begun.

At 8:52 AM, February 24, 2008, Anonymous Boo said...

As you stated, you are a software developer, not a biologist. Yet somehow you feel more qualified to speak on this subject than actual working biologists, and believe that you can simply take your expertise in computers and plop it right into biology. It's an interesting principle. I myself work with schizophrenics, and know little about computer programming. However, applying your principle that I could translate knowledge in my field to someone else's field just cause, may I suggest you try shoving Haldol tablets in your computer's drive the next time you have a problem with it?

At 9:55 AM, February 24, 2008, Blogger Smokey said...

"...there are a lot of unknowns and we have to make guesses about those unknowns based on previous experience."

But there's a lot known about the human genome, the sequences are in the public domain, the software tools to analyze them are in the public domain, but you ignore all the available data. What are you afraid of?

"In his criticism, he never provided any proof that the vast majority of DNA is junk, just ridicule."

The evidence is in the sequence data that you ignore. Real scientists are clear about "junk" being a provisional category based on a negative criterion (the Wikipedia entry is not a good source, btw).

"I’ll have to make several assumptions to come up with this estimate. "

Why are you afraid of using the data? It's all available.

"Thus 3.2 billion base pairs of the human genome is equivalent to 800 MB of data. Professor Page believes the human genome has only 24MB of information and that the rest is junk – that makes me laugh."

Yet his estimate is based on data, while you avoid all data. It seems that yours is the laughable position.

"Now there are 210 known cell types in the human body. I’ll assume that each cell type requires at least 1MB of information."

Why would you assume that? Why not look at real data?

"But this is just the information needed to construct the different cell types. More information is needed for spatial orientation and to coordinate activity among cells to perform complex functions like vision, motor control,..."

Are the complex neuronal connections involved in visual and motor pathways specified in the genome and hard-wired, or are they acquired during development using relatively simple algorithms and honed by experience?

Hint: there's a lot of data available that addresses this question. My hypothesis is that you are afraid to seek out and examine the data. Opinions aren't data, btw.

At 10:04 AM, February 24, 2008, Blogger Niels said...

You're making a fundamental mistake in assuming that because a genome consists of information resembling that of a computer program, then it has to work like a computerprogram too. There is a very big difference between the two.

A computer program is basically made up of instructions and data that is executed by the CPU and always the CPU. None of the instructions can do anything on their own.

In a cell every protein created from the DNA blueprint can, in theory, perform tasks independently of any other part of the cell and usually a single protein can perform many different tasks. And while the proteins themselves can be relatively simple (corresponding to perhaps a few kb's of data each), when interacting with each other in different ways, they can perform countless different tasks in the cell and on the organism as a whole. All of these tasks, if they were to be performed by a computer, would probably require mindblowing amounts of instructions and data, but since the cell doesn't work by executing instructions, the amount of "data" needed in the DNA is very small by comparison.

Every organism is made by a relatively small set of tools working together in a unique way. There's no reason to assume that a huge amounts of functional DNA is nessesary to achieve this.


At 10:11 AM, February 24, 2008, Blogger Chris Harrison said...

Randy, I'm not even sure why, in a post about the human genome, you were going on about bacterial DNA content. Why not just stick with the organism you're focusing on?

We know more than 3% of the human genome is functional; I don't know where you're getting this "97% is junk" from, but it's not an accurate reflection of current knowledge. That said, the term "junk DNA" has been used in a number of different ways since it's inception so whoever you got that quote from may be using the term in an odd way.

Just how much is functional is not known, but it's more than 3 %.

Your last paragraph is very wrong.
Here are two series of posts compiled (and mostly written) by Dr. Gregory that should interest you:

At 10:12 AM, February 24, 2008, Blogger Chris Harrison said...

PZ has some words for you here:

At 1:51 PM, February 24, 2008, Anonymous Anonymous said...

Since the most efficient algorithms to just sort n objects have an order of nlog(n) I am tempted to guesstimate by multiplying 22MB by log(210) to get a lower bound.

Um, what? Big-O notation describes time to execute. It has nothing to do with dataset size. Sorting a million things doesn't take more code than sorting 10 things.

Never mind that there are more efficient sorting algorithms. Have you never heard of pigeonhole sorts, bucket sorts, spaghetti sorts... My firm won't be hiring you anytime soon.

And that's just the problem with your comp sci.

Your understanding of biology is abysmal. Hint: ontogeny does not have a CPU.

If you have to work from analogy, consider that Conway's game of life creates surprisingly rich structures, and patterns out of 4 rules.

At 2:34 PM, February 24, 2008, Anonymous Douglas said...

Hi Randy,

As you're a software developer, you will appreciate this example. Here's a Hello World program, written in C:

#include <stdio.h>
int main()
printf("Hello World\n");
return (0);

By my count, it is represented by 79 bytes, and in a sense, it is obvious that it contains enough information to print "Hello World". On the other hand, it is obvious that it doesn't contain enough information, because there is no code to load the program into memory, no drivers to make the monitor display letters, or a thousand other things.

Another "Hello World" example, this time in Perl:

print "Hello World\n";

This is only 22 bytes long, but again it is clear that there is enough information to print "Hello World". It turns out that the "79 bytes" comment doesn't give you much information, unless you know how the data is encoded, how it is read, and how it is processed. In the case of DNA, unless you know the biology involved, the chemistry taking place and the underlying physics, comments about the number of bytes of information stored are equally meaningless.

With respect to junk DNA, here is another example right out of the computer science department. Say you know a programmer, and you know that he adds lots of extra comments to his code, and you also know that your compiler discards all of these comments when it compiles the code. In this case, you could call the comments "junk", as they don't affect how the program runs. You can make an estimate of the amount of junk simply by knowing the programmer's coding style, without reading the code itself. You can make similar estimates in biology with respect to DNA.

At 6:26 PM, February 24, 2008, Anonymous david said...

I can think of two other approaches that could be taken. For one of them I need some data points. In particular I need size data about genomes of the simplest multicellular life forms that are well studied and believed not to have junk.

You know, other people, maybe even evolutionary biologists, might have looked for the data before they started with the arguments from ignorance.

Take a look at the puffer fish genome for a start, 8x smaller than yours but still has ~80% of the genes you have. Most of the difference is that the introns, non-coding sections within genes, have been shortened in the Fugu genome. They are almost all in the same places within genes, suggesting their presence serves some purpose, but the fact Fugu lives happily with most of the actual sequence removed suggests they don't need to be as long as they are in humans an most other vertebrates.

Add to our bloated introns tonnes of ALU (~10% of your junkless genome is the same 300 nucleotides repeated) LINE (21%) elements and you have a lot of DNA that's not doing a lot.

Like Chris I'm not calling you an idiot, there's no reason outsiders can't have an opinion on evolutionary biology - I just wish you had made your opinion a little better informed.

At 7:22 PM, February 24, 2008, Anonymous Anonymous said...

To preface my comment, I wish to apologize for looking so unofficial by not using any sort of recognizable username, but I simply don't desire to make a Google Blogging account. Also, I wish to make it known that I find your desire to approach this problem from your knowledge base to be admirable, though it is a very, very basic beginning to attempting an answer in a highly technical field which, hopefully you will admit, you share very little with - and what you do share is limited to some abstract conceptual methodological reasoning.

That said, I have several problems with your approach.

Firstly, let us at least be consistent and not sloppy if we're going to make an analogy to computing. Despite the fact that I don't know very much in the way of computer science, I do at least know that your numbers for the memory space human DNA would take up are inconsistent with computer memory. For instance, I can click on my harddrive's properties tab and find that it has quite a few more bytes than it does GB. I (vaguely) understand this to be due to the conversion upwards in magnitude in a non-base-10 (base-8?) system, but it is not at all a superficial difference in a technically stringent approach.
Thus, I think it necessary that you remain consistent in your base-4 figures with what a DNA-MB would be compared to a DNA-byte.

Secondly, I find your estimate about how much "information" human cells' DNA have in common VASTLY arbitrary. As my very, very basic understanding of biology informs me, all human cells originated as (embryonic) stem cells. Well, instead of taking some arbitrary value for your basic cell DNA size (i.e., the equivalent size of the genome of the simplest possible bacteria), why not ask around to find out the size of DNA in an embryonic stem cell. I'm certain that figure is available in some journal or other, and may even be google-able.
From there, it would be appropriate to ask developmental biologists (like PZ Myers - Pharyngula on ScienceBlogs) on the blogosphere about how much the DNA of a differentiated cell grows when it is changed from an embryonic cell. Or, if it is appropriate, how much it shrinks.
Now, obviously an estimate of this change (most rigorously speaking, based on an average of all the changes) would suffice, but it would still be good to base your assumptions about the human genome on the available knowledge about the human genome.
Also, in considering this change in DNA-length from undifferentiated to differentiated cell, it might be more appropriate to consider how much "code" (DNA) is required. That is, how long is the proverbial set of instructions that the bone marrow performs on an undifferentiated cell to make it a white blood cell?

Thirdly (and lastly, which you are probably very grateful to hear), I think it would be prudent to consider the fact that after you have your basic DNA for making all the human cells, you can't just magically run a sorting algorithm to build the body. You need some sort of hardware to run that software, don't you? That is, doesn't there need to be some sort of controlling gene (or set of controlling genes) included in our various DNA sequences to run a sorting algorithm? These genes, much like an operating system, would require some space, right?
To draw further anecdotal evidence for this question, I think it is important to consider the problem you're trying to solve. As you probably know from wiki-hopping about junkDNA, it is actually DNA whose purpose is not currently known. The two types of sequence which currently have unknown functions appear to be, broadly speaking, sequences which are space-filling and/or temporary genes. Thus, it might be prudent to consider them as some sort of space for encoding functions in the genome's operating system. For example, one might imagine a sequence whose purpose would simply be to sit there as an "else"-controlling gene, which would explain why it is apparently useless in the actually nitty-gritty computations used to build people.

Good luck with the project!

At 7:28 PM, February 24, 2008, Blogger Jim said...

"But this is just the information needed to construct the different cell types. More information is needed for spatial orientation"

You really need a primer in developmental biology. I don't even know where to start, here. I guess this example will do.

You know how some people have six fingers or six toes? By your logic, this must have been an absolutely tremendous mutation; each digit requires "1 MB" or whatever, and for another one to be tacked on, well, you've obviously "copied" a ton of data! Right?

Well... no. Wrong. Development is governed by the interactions of a number of generally highly conserved signalling molecules, which cells send and receive at various times. There's an amazing cascade of events that begins at fertilization that leads to one cell turning into trillions, but the truly amazing thing is that none of it is predetermined the way creationists often assume ("But what about the eye?!"). The sixth finger can result from something as minor as one transcription factor screwing up in one cell very early on.

Where I'm headed with this is that software is a terrible analogy. Code is relatively straightforward. A function is never going to do anything you don't let it do. Proteins, on the other hand, can cause or prevent the formation of other proteins, and generally do whatever the laws of physics allow them to do. If you have a group of a hundred thousand proteins, each of which can interact with every other protein, that's ten billion potential interactions. In actuality, given that protein complexes can involve virtually and number of proteins, into the dozens or hundreds, well, you're rapidly approaching virtually infinite complexity. And then when you get into things like differential transcription, you get meta-meta-meta levels of complexity. It's staggering, and far beyond anything software can do.

In your defense, "junk" DNA may introduce conformational changes that prevent or enable transcription, and other "auxiliary" type functions that are difficult to pin down with current technology. Hence the use of the term "noncoding" being preferable to "junk".

However, as Chris noted, there are several well-established and incontrovertible mechanisms by which "good" DNA can be turned into "junk", and by which "junk" can copy itself. You may not be a fan of Richard Dawkins, but his paradigm of the gene as the fundamental unit of heredity (and not the organism) would probably be eye-openining for you, as it explains why "junk" DNA is a perfectly viable strategy -- your DNA doesn't really care about you. You're just the latest easily replaced bodyguard for your genes.

At 9:53 PM, February 24, 2008, Anonymous Anonymous said...

I tell ya, you have your analogy wrong. The code for MS Word is not at all like DNA in an organism - it is the organism.

MS Word evolves over time (duh, versions) due to the pressure of natural selection (consumer demand for a better program). The cell would be Microsoft, with it's various structures (including software designers) that consume energy to keep the organism alive. And that makes the 'DNA' for MS Word... Bill Gates.

How many megabytes is he?


At 11:38 PM, February 24, 2008, Blogger Intelligent Designer said...

Hi Chris,

That explains why my blog has had a massive number of hits today. Apparently PZ has a very popular blog. Its turns out that I visit both Pharyngula and Germicron regularly.

To Everyone Else,

I have also received a number of comments many of which contain blatant bigotry or are less than polite. If you have made such comments don't be suprised if I reject them. Try contributing some useful information like Chris instead.

At 11:58 PM, February 24, 2008, Anonymous Anonymous said...

In addition Junk DNA is not simply junk. There are various useful aspects to have large segments of non-coding DNA. For example, the position of genes on the genome often has important uses such as timing or accessibility for transcription. Even though the DNA itself is "junk" in that it doesn't code for anything, it serves a purpose in terms of the genome as a whole.

At 1:37 AM, February 25, 2008, Anonymous Sandeep K said...

(In reply to your previous post)

Intelligent designer, that is bad reasoning. The Mandelbrot fractal ( ) is very complex but just see how much is required to encode it in a program. Or think about the amount of space required to *store* a program that can print out transcendental numbers.

Besides, you can't just look only at the size of the program. The functionality provided by the execution environment can make a program arbitrarily large or small. Not just talking about libraries here, the VM (Virutal Machine) / CPU all matter.

So, please do read up more on this subject. You are just reiterating the same argument used by many before you. The human eye is so complex... then its shown the human eye is actually not so well designed. What if you had never seen anything like a fractal and I show you one and tell you it can be produced by an extremely short program. You would say its impossible. But only when you understand the mechanism with which it is produced you begin to realize that it is indeed possible. We may not understand all the mechanisms and interactions at every step with with the genetic code is able to produce the fantastic output, but we have just started tinkering with this new code! How fascinating is that code's output is now trying to act as a debugger and starting to analyze the program that produced it :). And if anyone finds and demonstrates that DNA does not have junk data, I'm sure they wouldn't hesitate to present their findings.

Maybe you will like this: DNA seen through the eyes of a coder:

At 11:30 PM, October 09, 2013, Blogger sinuse jill said...

This comment has been removed by a blog administrator.


Post a Comment

<< Home