Saturday, May 02, 2009

The Probability of Information -- Part 2

In my first blog entry about the probability of information I used the English language to show that the probability of generating a significant amount of information randomly is essentially 0. It’s important to note that the example is not just an analogy; it illustrates a fact that applies to all kinds of information – including DNA. The calculation may vary depending on the type of information but the end result will always be the same. I only chose English because it is a type of information we can all understand. However in this blog entry I want to talk about the properties of DNA that make it so improbable that a randomly generated sequence of DNA would represent a life form.

Coding segments of DNA consist of a sequence of codons (nucleotide-triplets). There are 64 (4 nucleotide types to the third power) possible codon types. Of these codon types there is one start codon and 3 stop codons. All coding segments of DNA begin with a start codon and they end when the first stop codon is encountered. There may be some start codons between the initial start codon and the end codon but they don’t matter. The length of a coding sequence is determined by the distance between the first start codon and the first stop codon. Given that information, what would be the average length of a randomly generated coding sequence? It would be 22.333 = 1 + 64/3. The 1 is for the start codon and 64/3 for the stop codons.

Above is a Silverlight application that I wrote to generate codon sequences. It uses the 64 characters below to represent the 64 codon types. A is used for the start codon and . ? ! are used for the stop codons.

ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz
123456789.?!

Click the Generate Codon Sequence button. It will generate a random codon sequence and display it. Click it several times and note that it will keep track of the length of the longest codon generated and compute the average length of all the codon sequences generated. Try clicking this button for a few minutes and see if a codon sequence longer than 300 pops up. If you get tired click the button to generate 1 million codons. The application will display the longest codon sequence generated and its length.

Human DNA contains thousands of coding sequences. The average length of a human coding sequence is about 3000 bases, or approximately 1000 codons. The longest coding sequence in human DNA is over 2.4 million bases or 800,000 codons. Click the Generate 1 Million Sequences button until you get one over 500. Well on second thought that would be a waste of time. Click the Generate Codon Sequences Perpetually button and let the program run. Come back tomorrow and look to see if you get one.

What are the chances that a randomly generated coding sequence would be longer than 1000 codons? About 1 in (64/61)^1000, that is, about one in 7.1e+20 . So what is the probability that a codon sequence could be generated randomly that would be of use to a life form? It’s really impossible to calculate. We have shown that the probability of randomly generating an average length coding sequence are low and we haven't even begin to talk about what properties would need to be satisfied for the sequence to be useful for a viable life form. If we use the example in my first blog entry on the Probability of Information as a basis for comparison it’s like we haven't even checked to see if property 1 is satisfied yet.

Someone is going to argue that natural selection can help us to generate a viable coding sequence. My question is how? Natural selection works on traits that give life forms an advantage in their environment. A minimum a trait would be a by-product of several genes working together to give a life form some advantage. In other words, natural selection can’t begin to work until after viable coding sequences already exist.

One of the annoying things about Silverlight is that it doesn’t support DoEvents like Windows applications do. The source code of this Silverlight program demonstrates how to simulate the DoEvents method in Silverlight using a Thread and Dispatcher. This allows the program to update the UI and process the STOP button while running in a tight loop to generate codon sequences perpetually.

CodonSequenceGenerator Source Code

REFERENCES

Human Genome Project

At 11:46 AM, May 05, 2009,  kezdro said...

Sigh - why do they always ignore me, or leave before I get a chance to get back to them? I do so try to present clear, on-point arguments, questions, and critiques. Oh well.@IntelligentDesigner:
"This will be my last post tonight. I hope you all had took the opportunity to play with the Codon Sequence Generator on my last blog entry."
Meh - no real reason to play with it. It's basically a geometric distribution where P(success) = 3/64. Not much need to use a simulation for that, the mathematics of it are rather well known. You are correct that there is more to evolution and other relevant topics than you appear to be aware of. As such I will present several comments to improve the fidelity of your simulation, and am looking forward to seeing your updates to the generator in response. Do not mistake my comments for being the sum of evolutionary theory though - this is only a starting point.

I am unsure if your generator is meant to simulate the probability of successful abiogenesis or something more general, so I will separate the two topics.

Abiogenesis:In this case, your simulator is most severely flawed by assuming:
•That abiogenesis would require sequences equivalent in length to modern day gene sizes, and
•That abiogenesis uses the same methods and materials (protein/nucleotide/coding-wise) as modern organisms.

Regarding the first element, you may want to look at (1) this and (2) this. Item (1) demonstrates replicators needing only about 76 nucleotides. Recall that there are 3 nucleotides (generally) in a codon, thus that is around 25.33.. 'codons'. Of course, they do not function as codons, which is why the second list element is important (see next ¶). Item (2) demonstrates a mechanism by which very small lengths of nucleotides can be beneficial, along with selection for longer strings of nucleotides (recall that the N in even a small body of water over long spans of time can be very, very large - many orders of magnitude above your million repetitions).

As far as the second list element, the same two papers are also demonstrative. Neither of them requires an external copier such as a ribosome, thus 'start' and 'stop' elements may be irrelevant. Further, the second paper demonstrates that there can be selection for replicators or copiers before the replicators or copiers exist.

In general:In general, your simulation lacks the following elements:
•Mutation - your program simulates random generation of sequences, but fails to include sequential mutation. You have demonstrated that you are aware of this as a basic tenent of evolutionary theory, so I assume that was just an oversight.
•Non-coding sequences - You have noted before, in a flawed understanding of the applicable nature of randomness (note 1), that recombination incurs mutations (note 2). Such non-coding sequences can be relevant to those, and other, mutations.

Also worth mentioning is a consequence of those two elements: sequences can take random walks in length over generations, and are not limited to the initial length. This is easily demonstrated in the positive direction by considering gene x copied/transposed into the middle of gene y, where length(x) ≥ length(y).

I had originally not intended to mention selection yet, as that could take considerable effort to implement realistically/usefully in code (most obviously, because of the difficulty of simulating function and folding), but paper (2) shows an easy-to-implement differential survival mechanism. As we are obviously starting with sequences of small length, simply select for those with the longest lengths, as they would make such a micelle more stable, and thus more likely to survive and/or 'reproduce' (note 3)

Keep in mind, these comments are only a starting point, and do not cover everything.

1: As I replied before:
"Most mutations are the result of recombination and are therefore are not completely random."
That they are the result of recombination is not important - randomness only requires that an event can not be predicted from prior events. In this case, that a mutation results from recombination rather than, say, a point mutation is not relevant - the events measured are changes in sequence, not the methods by which they occur.
2: I assume you mean more than just chromosomal crossover events, which are hardly the extent of viable mutations.
3: You may wonder how this could relate to contemporary organisms. Consider that at least some sequences of those lengths are likely to be replicators or polymerases (due to high N, not high p). Also, such replicators and polymerases will likely produce long sequences at a faster rate than the surrounding environment, thus becoming the dominant long-length species.

cross-posted at IntelligentDesigner's blog: randystimpson.blogspot.com/2009/05/probabilityof-information-part-2.htmlcross-posted at Pharyngula: scienceblogs.com/pharyngula/2009/05/open_thread_frog_vent_the_blas.php

Also, tag support here sucks. No blockquote, no lists (had to use bullet entities), and superscript.

At 4:26 PM, May 05, 2009,  Intelligent Designer said...

Hello kezdro,

Thank you for your thoughtful comments. It will take me a while to digest your references and respond. I have also used up my play time for a while so don't expect a response soon. When I do respond it will likely be with another blog entry which I will link to from these comments. Expect a Silverlight application or two that simulates mutation of information to accompany future blog entries.

At 2:45 PM, May 13, 2009,  Lord Runolfr said...

I don't want to reinvent the wheel, so to speak, so I'm going to direct you to the summary from a rather lengthy discussion of probability and how it applies to the Theory of Evolution.

LinkI think that sums up the problems with your codon generator: it's assuming a level of randomness that doesn't necessarily exist, as well as making assumptions about the minimum length of a codon chain that you can't validate.

At 9:46 PM, May 13, 2009,  Intelligent Designer said...

Lord Runolfr,

I read the entire discussion on probability. There wasn't anything of substance there. The author demonstrates that he knows some basic probability and how it applies to the lottery, dice, and cards while making snide remarks about the intelligence and integrity of creationists. His argument on page 3 suffers from the same fallacy that Richard Dawkins' weasel program does -- that natural selection can be applied to a small amount of information. In the author's case of the dice example, he assumes that natural selection can apply to a little as three bits of information. As for "assuming a level of randomness that doesn't exist" please elaborate on the chemical properties of nucleotide sequences that would exclude some possible codon sequences – or provide a reference.

At 10:25 PM, May 13, 2009,  kezdro said...

Regarding the fallacy you mentioned - do you mean it makes an unstated assumption (natural selection can act on small sequences), or that the assumption is itself wrong (natural selection can not operate on small sequences)?

If it's the latter, could you please show me where it's been shown that natural selection can not operate on small sequences, and where that dividing line is? I've not seen any research on the matter.

(Also, I may be wrong but I thought the weasel program was meant as a demonstration of the concepts of mutation and selection - a programmed analogy, if you will - rather than evidence, so I'm not sure that it's really a fallacy in that sense)

At 11:41 PM, May 13, 2009,  Anonymous said...

Wow, you're the oldest 10 siblings? That must suck to be attached to 9 other people AND be an idiot.

At 4:47 AM, May 14, 2009,  Anonymous said...

The sound of reality constantly tapping you on the shoulder must get annoying after a while.

At 4:59 AM, May 14, 2009,  BobbyEarle said...

Hey, Stimpy, how hard is to update Silverlight? Or is that something your intelligent designer is supposed to do for you?

At 11:58 AM, May 19, 2009,  Intelligent Designer said...

Kezdro,

Relating back to your first comment, I don't have plans to update the codon sequence generator. I am considering another program that would simulate random mutation of a genome and selection. Do you have suggestions for such a program that would prove something one way or the other to you? I would like to collaborate with someone with an opposing viewpoint from mine about the requirements of such a program.

At 12:39 PM, May 19, 2009,  Brock said...

Just dropping a link about actual science...

Origin of Life: Building an RNA world from simple chemicalsIt's hard to believe people are still spinning their wheels over the issue of "random information" when thermodynamics and natural selection aren't random (and the latter is the exact opposite) and "information content" is in the eye of the beholder :p

At 9:42 AM, May 23, 2009,  Paul said...

I fully agree that as one off generation of codons like you have done in your program, the chances of generating a long sequence is astronomical.

However...

What you are presenting here has nothing to do with evolution.

One of the very fundamental features of Genetics is that you inherit DNA from your parents.

What you have gopi9ng on here is that each generation (new sequence generated) has to create their entire sequence from scratch!

In evolution, because you inherit your DNA from your parents, you don't have to generate it from scratch each generation.

Each generation keeps their parents DNA and makes slight mutations to it. It is this that allows DNA to increase in length.

Lets try this experiment (if you use imperial measurements you can substitute inches for cm):

What you need:
A ball of Wool or string
A pair of scissors
A 6 Sided Dice

Here is what you do:
1) Cut 20 random lengths of string (make them small to start off with, about 10 cm long)

2) Sort the lengths of string according to length

3) Take the top 5 longest pieces of string and discard the rest

4) Take the fist of these longest lengths of string and roll the dice 3 times and do the following:

If you roll a 1 on the dice, then using the original piece of string as a guide, cut a new piece of string 1cm shorter.

If you roll a 6 on the dice, then using the original piece of string as a guide, cut a new piece of string 1cm longer.

If you roll anything else cut the new piece of string the same length as the original.

5) Repeat the last step three times for each of the longest 5 pieces of string.

6) Repeat steps 2, 3, 4 and 5 as many times as you like.

What do you notice happening?

Well first of all, you will start off with many different lengths of string, but then after the first 3 or 4 time you go through this you end up with all the string about the same length.

But then after your 10th to 20th time through, you will notice that you now have string longer than your starting strings.

But how is this possible? You have the same chance of increasing the length of the string as you ahve of shortening them. If we apply statistics, then the lengths of string should occasionally get longer, but also occasionally get shorter.

What is occuring is two things:

1) Inheritance.
Because we are basing the new lengths of string off of the original lengths of string we are not starting from scratch each time we cycle through it (each generation).

2) Selection.
Because we are selecting only the longest pieces of string, when we have a "mutation" that causes a new piece of string to become shorter it will get eliminated from the set of strings in one or two cycles (generations) of the experiment. But as there are still strings that are getting longer occasionally, and these are the ones that end up being the basis for the new generations the length of the strings increases over time.

But, one argument is that getting shorter or longer has an equal chance. Well lets see how an unequal chance would do:

Instead of cutting a piece of string shorter on a 1, now do the experiment again, but cut it shorter on anything but a 6 (and still have the string cut longer on a 6).

Now we have 5 times the chance of getting a shorter string than getting a longer string. It takes a bit longer for it to occur, but if you persist, you still get strings getting longer and longer over time and not shorter (this time it will take around 50 "generations" for you to see anything really occuring - but they still get longer).

In other words: If we consider the length of string to be analogous to the length of DNA, then if there is a selection for longer lengths of DNA then it can and will get longer over time.

But it also works for the content, not just the length (and you can also make it shorter too). It all depends on what is being selected for.

At 11:08 PM, May 26, 2009,  Intelligent Designer said...

Hi Paul,

You are right. I haven’t talked about evolution yet; I have only been talking about the probability of information. However, your exercise with dice and string doesn’t apply because string doesn’t have the same properties as codon sequences. If we want to prove that codon sequences can get longer we need to simulate mutation and selection on codon sequences. Would you like to propose mutation and selection criteria for doing this?

I plan to write a program to perform simulation of random mutation and selection on an artificial genome to see if we can evolve a genome that looks human in respect to the average length of coding sequences. If we can evolve a set of long coding sequences this would only satisfy a necessary condition for evolution (by random mutation and selection) and is by no means sufficient. We won’t have said anything about what is going on in the middle of the sequence that makes it useful. We will have only specified that there be a start and a stop codon.

If we can’t evolve these long sequences the simulation will strongly imply that evolution by random mutation and natural selection can’t occur. Naturally this still leaves room for the possibility of evolution by intelligent design.

At 5:18 AM, May 27, 2009,  Paul W said...

There is nothing actually "special" about the the ability of a codon as an information storage/container. The only th8ing special (in regards to a codon) is that it interfaces with the molecular machinery in cells so as to allow them to manufacture proteins from the amino acids.

What is important is the process that is going on.

The exact properties are not important, so long as the correct "algorithm" can be "run" by the system. Whether the system is pieces of string and dice, bits in a computer program or DNA (In fact they are making computers that use DNA which proves that DNA can run algorithms and so can form a Universal Turing Machine - computer for those that don't know what that is).

Quote: IntelligentDesigner
"I plan to write a program to perform simulation of random mutation and selection on an artificial genome to see if we can evolve a genome that looks human in respect to the average length of coding sequences."I ahve done fairly extensive experimentation with Genetic Algorithms (using computer algorithms to implement the procedure of Evolution), and I know that this will work because I have done it and it did work.

I planned to include the algorithms in my last post but the length was too long and so I had to remove it. If there is space in this post I will post the algorithm, but if not, I'll post a second entry specifically for that.

Looking at the String example, you will perhaps see that one of the basic features of the algorithm is that any string that is equal or longer than the longest string will be retained, where as if the string is shorter, then it will be removed.

If variation occurs in the length of the string (the next string derived form the original is either slightly longer or shorter than the original) then it is obvious that the strings must get longer.

Now, how does this apply to a sequence of codons?

Well in the string, the longer lengths were retained and the shorter were removed. If we also do this to the sequence of codons, then they too will increase in length (it is the same algorithm, just that we are using a sequence of codons, rather than a sequence of string fibres).

If you look at the string example again, you will also see where I had the chance of the string being cut shorter at 5 times that of it increasing, and yet the length of the string will still increase.

We are applying the same algorithm of removing the shorter ones (so it does not matter what chance this has of occuring as when it does occur it will be eliminated) and that we keep the longer ones (again, it does not matter what chance this has of occuring as when it does we keep it).

As I said, evolution is not a random process. The variation might be random, but this has no impact on whether evolution occurs or not (although you still need variation, you can have no random variation and it will still work).

It is why your whole premise is actually a misunderstanding of evolution. You can't apply the statistics of chance or analyse it as if it is like rolling dice as Evolution is not like that at all.

As the Human genome did not come into existence as a single event, but instead evolved over a long time span, then trying to prove that the odds of randomly creating a human like (length) genome sequence by using randomness, like you have, is actually incorrect. It is the most common mistake about evolution.

Evolution is not random, so you can't apply that kind of statistical analysis to it. Instead, you ahve to apply algorithmic analysis you it (and as a programmer you obviously know how to do this).

At 6:06 AM, May 27, 2009,  Paul W said...

The evolution algorithm:

Ok, as I said I would post the algorithm for evolving DNA like strands that show that length can increase.

The first thing we have to do is identify what we are using as the selection criteria.

Obviously as we are interested in length, then this will be our selection criteria. So we will use the length form the first Start Codon to the First Stop Codon.

Secondly we have to know what kinds of variation will be occuring in the algorithm.

In DNA there are several types of mutation that can occur:

Point Mutation: This is where a single base pair is change to different base pair. In our algorithm this will be done by randomly choosing a codon and changing it to a random value.

Insertions: This is where 1 or more base pairs are inserted for various reasons into a codon sequence. This can occur due to the replication machinery in the cell having an error (due to chemical interference) and inserting extra base pairs.

These can include sequences of a single base pair, right up to whole genes (what this codon sequence is supposed to represent).

To simulate this we will take a random point in the genome as the start of the duplication string and then move a random number of codons along from there as the end of the copied sequence and copy that entire sequence and insert it between the end of that sequence and what would have been the next codon after that sequence.

Deletions:

These are also encountered in DNA, where whole sequences are removed form the gene being copied. So we will apply a similar algorithm to the insertions, but we will delete it instead of copying it.

So, now we have the essential parts needed to construct our Genetic Algorithm:

Selection: Longest Sequence from the first Start codon to the first stop codon.

Variations
1) Point Mutation: Change one randomly selected codon in the sequence to a random value

2) Insertions: Duplicate a randomly selected section of the sequence where the probability of the length being given decreases exponentially with the length.

3) Deletions: Remove a randomly selected sequence where the probability of the length increases linearly with the length.

Lastly, We need to work with a population of entities (codon sequences). From my own experiences, a population of around 10,000 would be a good number to see the effects of this, but we should use as many as will fit into memory.

Now the algorithm is really simple:

1) Generate an initial, random population (this would include length, as well as the values of the sequence. The algorithm you have used in the initial post is exactly what we would need, just remember you have to store all of them in an array.

2) From that array, select the sequences that have the greatest length between the first start codon and the first stop codon.

3) From these selected sequences generate new sequences (and keeping the original sequences) but apply the mutations to them randomly as they are created (and you can apply a mutation more than one and more than one type of mutation to any one copy).

4) Report Statistics, include both the total length of a sequence as well as the length form the first start codon to the first stop codon.

5) Repeat steps 2, 3 and 4 until either the user stops the program, or a predetermined threshold is reached (say a length from the first start codon to the first stop codon of 900 codons). The reason for this is to allow the program to stop at some point (but in real evolution it does not ahve this exit requirement - unless of course you count asteroids as a quit command ;D ).

And that is pretty much it. The coding is more complex (but I don't know silverlight and you do - as well as being able to embed it).

At 4:09 PM, May 27, 2009,  Intelligent Designer said...

I like your requirements, but I want like to refine them so that they are more realistic. Of course we are limited on how real we can get because of computer memory limitations and the amount of programming I can commit to, but I think we can get closer to reality than what we have so far. First of all we do need a stop command because we don’t actually have forever for evolution to occur. We are limited by the age of the earth. So I think we need define a generation, a mutation rate, and a time limit.
We can think of the 10,000 codon sequences as a genome. Apply X random mutations to it according to a mutation rate to make a child genome. Do this Y times to the parent genome to produce Y offspring. Then select the child that has the genome with longest average codon sequence length. Does this sound good to you? If so, maybe you can suggest a plausible mutation rate, a plausible number of offspring, and a plausible number generations to execute before we stop.

At 2:21 AM, May 28, 2009,  Paul W said...

Quote IntelligentDesigner
We can think of the 10,000 codon sequences as a genome.The 10,000 sequences is the population. As evolution looks at the spread of alleles throughout a population, we need to have a population.

A quick google of mutation rates in humans gives an approximation of 2.5×10^-8 per base per generation.

Remember to be more accurate each of the "letters" in this simulation is a codon, which is 3 base pairs, so the number of bases for the calculation on mutations needs to be multiplied by 3.

As for stopping, either have it as a user input (a stop button) or stop when the length of the sequence has increase to at least 800 codons (as we are looking for a length of 1000, but starting for around 22 codons long (although this is determined by random creation).

Actually, here is a real challenge for the program. It is known that base pairs can self replicate if they undergo cycles in temperature (cold to hot and back to cold). This is used in devices called PCR machines.

We also know that base pairs can spontaneously be created through normal chemical reactions, not only that, they have detected them in nebula throughout space (there is a characteristic spectral pattern that can determine them when using a spectrometer). Bases can also spontaneously polymerise into chains (form sequences) so codons can spontaneously form without any replicating machinery.

So, why don't we start with a singe codon and see if it can grow from a single base pair, through evolution, to achieve a long codon sequence (I have done this at the level of base pairs and been successful).

If it can do this, then it proves that from a single codon length (and even a single base) that evolution can increase the length of a sequence of DNA.

Another thing to note is that we are not directly selecting for the total number of letters in the sequence, only the number of letters between the first start codon and the first stop codon, so some of the mutations will occur in places and ways that do not directly effect the length of what we are looking at, but what will occur is that when the length of the critical sequence (the sequence between the first start codon and the first stop codon) is close to or equal to the total length of a sequence, then any mutation that does increase the total length will give an advantage to that sequence that can be taken advantage of by other mutations.

So by selecting for the length between the first start codon and the first stop codon we are indirectly selecting for the total length of the sequence.

At 12:34 AM, June 03, 2009,  Intelligent Designer said...

Paul the issues you brought up about spontaneous generation and replication would add to much work to the algorithm should I try to simulate them so I will ignore them for now. Also the average gene length in the human genome is 1000 codons long. The longest gene is 800,000 codons long. So, evolving only one gene to a length of 900 codons doesn’t really prove that a human or eukaryote like genome could evolve by natural selection. At minimum, we need to show that natural selection can produce a genome with an average gene length of 1000 codons in a plausible amount of time.

I was hoping you would provide criteria for a time limit but you didn’t so I will take a crude stab at it. Let’s set the maximum number of generations to be one generation per hour for 4.5 billion years, that is, 4,500,000,000 x 365 x 24 = 39,420,000,000,000. I think that is generous on my part for a lot of reasons. Let me know if you think differently. 39.42 trillion is a big number. It would take years to crunch through that many generations on my computer. Luckily our mutation rate is infinitesimal so we can actually simulate thousands of generations of mutations with one cycle. I will explain this in detail on the blog entry that will host the application. I’ll also make the application configurable so that the user can have some control over the simulation.

Next I will get more specific about computing the length of insertions and deletions but I’ll give you a chance to suggest a formula first. I suggest we don't assume that all insertions are duplications.

At 7:17 AM, June 04, 2009,  Paul W said...

Bacteria can, on average in optimal circumstances, double their number in a couple, of hours.

Humans have a generation time of around 20 - 25 years.

However, the length of the DNA does not govern how quickly on organism reproduces.

Single celled organisms existed on Earth since (at least) 3,450,000,000 years to around 565,000,000 (or about 2,885,000,000 years) before multicellular organism evolved.

If we take the generation time as being 6 hours (and it could be much shorter than this), then we are looking at around 4,212,100,000,000 generations before multicellular organisms appeared.

So your estimate of 39 trillion is a little large, but it would be closer to 4 trillion before multicellular organisms appeared.

One thing you will have to remember, is population sizes. We will not be able to deal with 7 billion individuals in the population. Instead we will have a population of around 1,000. So any rate of change will be far less than what you would get in a population the current size of the human race. As several billion bacterial cells can exist in even something the size of a back yard swimming pool, when talking about the population sizes for the oceans of the world during the single cell evolution on Earth (and even today), we are talking absolutely astronomical population sizes, something we could not even hope to model properly on current computers.

Because of this, any rate of change will be at a minuscule fraction of what could ahve potentially occurred on Earth. However, we have one thing in our favour: We are aggressively selecting for size, rather than what would have occurred on the Earth during the single cell period when genome size would only be indirectly selected for due to the increasing complexity of other single celled organisms.

At 7:17 AM, June 04, 2009,  Paul W said...

The Human Genome is not the largest genome out there, but some of the largest bacterial genomes are around 10 million base pairs (where as the human is around 3.4 billion base pairs).

But this is not important really, what we are looking at here is whether or not evolution is capable of increasing the size of a gene faster than a purely random generation (as in your first program). To increase the size of the sequence to 1000 in less iterations than it would take to randomly generate it with your first program is a success as it directly confirms the question: Can a gene sequence be increased in size by evolution in less time than it would take to generate that same length of sequence by pure random creation?

The second question is: can the rate of length increase be fast enough to occur within the time frame of life on Earth?

Also, it is important to remember, we are not directly selecting for the length of the whole sequence. We are only directly selecting for the length between the first start codon and the first stop codon. This will also answer the question of can Evolution act indirectly on the contents of a genome sequence.

I haven't done, and can't recall any specifics about mutations in determining the length of insertions and deletions (although I am sure that there would be such research being done), but I do know that they can range from a singe base pair up to whole chromosomes (as in Down's Syndrome) and I would assume that deletions would be somewhat similar, if for no other reason that to make them at least similar for the sake of the experiment (as it would be hard for us to delete more than 100% of the current sequence :D ).

As for the chance, we could just go with a linear formula (the algorithm selects a random number between 1 and the length of the genome after the start of the insertion or deletion.

Or we could go with an inverse N chance (where N is the length of the sequence.

So if N=1 gives a 100% chance (we have already determined that we are doing a mutation, so at least 1 entry will be mutated), then 2 has a 50% chance, 3 has a 33% chance, 4 has a 25% chance, and so on.

If you really want to test if evolution can increase the sizes of genomes, why not have the linear method for the deletions and the Inverse N method for the insertions (so a much higher chance of a long deletion than a long insertion).

At 6:50 PM, June 14, 2009,  Paul W said...

I am wondering how the new program is going?

At 11:36 AM, June 15, 2009,  Intelligent Designer said...

I haven't yet started on the program. I have been to busy at work and home to make time for it. It should only take me a day or two once I get a block of time for it. So far I only have some data structures and algorithms rolling around in my head that will help the program to get through a lot of generations of mutations fast.

At 9:04 AM, August 13, 2009,  Paul W said...

It has been a couple of months since I last heard from you and I am still interested in how the program is going?

At 9:28 AM, August 13, 2009,  Intelligent Designer said...

I am about halfway there. The simulation portion is done but I the presentation isn't. My schedule hasn't allowed me to work on it for about month. I need one free weekend to hammer it out.

At 7:17 PM, September 05, 2009,  Intelligent Designer said...

I have an initial release. There are still some bugs to workout but you can look at it here.