Most DNA is not Junk
One of the things that software developers hate most is providing estimates for how much it will cost to develop a piece of software. It’s hard to answer with any accuracy a question like that but we have to do it so that bean counters can make business decisions. The reason why the question is hard to answer is because there are a lot of unknowns and we have to make guesses about those unknowns based on previous experience. Books have been written on estimating software development costs but a simplified description of the process goes like this:
1) Break the development tasks up into subtasks.
2) Estimate how much it will cost to do these subtasks based on the costs of similar tasks completed previously.
3) Add the estimates of the subtasks to get an estimate of the total.
As you can see there is a lot of guess work involved. It’s not uncommon for such estimates to be off by more than a factor of two. Perhaps they should be called guesstimates.
In my blog entry Junk DNA is a Myth I spouted off about how it was ridiculous to think that 97% of our DNA is junk. I could believe 5% junk due to entropy but not 97%. This blog entry came under criticism by Professor Scott Page. In his criticism, he never provided any proof that the vast majority of DNA is junk, just ridicule. This ridicule may have been a knee-jerk reaction to my blogging alias “Intelligent Designer”. Scott Page believes that anyone who is a creationist is an idiot.
In my defense I am going to make a stab at guesstimating a plausible amount of non-junk DNA in the human genome. I can already hear Scott laughing away in his office now. I’ll have to make several assumptions to come up with this estimate. I’ll italicize these assumptions as I go along. I also plan to make revisions to this estimate as I think through it so please don’t quote me until I am done. I am publishing this draft because I am hoping someone with expert knowledge will stumble on it and chime in with useful information.
So let’s begin. In this estimate I will be using the word “information” to denote DNA that is not junk and “data” to denote DNA which may or may not be junk. I will also be talking about the data in terms of bytes and MBs [megabytes]. A nucleotide can be represented with two bits of data, a string of 4 nucleotides by a byte of data, and 4 million nucleotides by a MB of data. Thus 3.2 billion base pairs of the human genome is equivalent to 800 MB of data. Professor Page believes the human genome has only 24MB of information and that the rest is junk – that makes me laugh.
According to Professor Larry Moran "a bacterial genome is about 4 million base pairs and there's no junk". So I think it is safe to say that there is at least 1MB of information in the human genome.
Now there are 210 known cell types in the human body. I’ll assume that each cell type requires at least 1MB of information. These cell types share a lot of common features so I’ll assume there is a lot of common information. Just how much of the information is shared between these cell types is a guess. I am going to assume that 90% of the information in each cell type is shared and 10% is unique. This means that 210 cell types require 1MB + 209 * .1MB of information. Rounding this implies that there is at least 22MB of information in the human genome.
But this is just the information needed to construct the different cell types. More information is needed for spatial orientation and to coordinate activity among cells to perform complex functions like vision, motor control, digestion and tissue repair. Since the most efficient algorithms to just sort n objects have an order of nlog(n) I am tempted to guesstimate by multiplying 22MB by log(210) to get a lower bound. But that would be bad applied math and just plain lazy. But then again I am not exactly getting paid to do this (wink).
I can think of two other approaches that could be taken. For one of them I need some data points. In particular I need size data about genomes of the simplest multicellular life forms that are well studied and believed not to have junk.
TO BE CONTINUED