Abstract: We propose using random walks in software space as abstract formal models of biological evolution. The goal is to shed light on biological creativity using toy models of evolution that are simple enough to prove theorems about them. We consider two models: a single mutating piece of software, and a population of mutating software. The fitness function is taken from a well-known problem in computability theory that requires an unlimited amount of creativity, the Busy Beaver problem. (Talk given Friday October 10, 2008 at the IBM Watson Research Center in Yorktown Heights, NY. The author wishes to thank his colleagues Charles Bennett and David DiVincenzo for their helpful comments.)
I have been interested in theoretical biology for a long time [1]. However I could not find a way to go forward. Here we propose an approach that I feel shows promise, but much remains to be done. We only present some preliminary results.
The immediate stimulus for this paper was Berlinski's entertaining polemic [2], in which he presents a number of criticisms of Darwinian evolution, a number of perplexing aspects of evolution that he feels are not yet well explained by Darwin's theory.
Here are five areas in which I believe that thinking of DNA as mutating software can be helpful. (Software mutations may be point mutations, such as changing or adding a single bit, or they may be high-level, such as copying a subroutine and then modifying it.)
On the other hand, if we are mutating software written in the extremely compact kind of representations used in algorithmic information theory [3], then there is no need for DNA and the genotype can remain the same as the phenotype.
Now I've found a way to do that. In this paper we describe some simple models where you can show that complexity increases due to increased function, not due to software bloat (point (4) above).
For example, 100100 is certainly a big number, but we can do much better than that.
Following Hodges [5], write n↑m instead of nm. Then define n↑↑m as follows:
Going beyond Hodges, define n⇑m to be n↑...↑n with m up arrows! What is the value of 999⇑999?
Unlimited creativity is required for this problem, because for any scheme for naming large numbers, one can come up with a better scheme, just like we improved on Hodges.
(See also Steinhaus [6], Knuth [7], Davis [8].)
Naming big numbers is closely related to a classical problem in computability theory, the Busy Beaver problem, and the Busy Beaver function. If you are naming numbers informally as we have just done, then a Busy Beaver function of N can be loosely defined as the biggest positive integer you can name in ≤ N symbols or characters of text.
If you are using a programming language to name positive integers, then the Busy Beaver function is the greatest positive integer that can be produced by a program of size ≤ N bits (programs that calculate a single integer and then halt). Busy Beaver functions grow faster than any computable function of N, which shows that naming big numbers requires an unlimited amount of creativity.
For more on this, see Aaronson [9] and Wikipedia [10].
We have to pick a universal Turing machine = a general-purpose programming language. I'll use the universal Turing machine that I'm most familiar with, the one U in [3], which has the property that for any other Turing machine C, there is always a prefix πC such that the output U(πC p) produced when the concatenation of πC with p is run on U, is the same as the output C(p) produced when p is run on C. This shows that the programs for U are concise, or, more precisely, not much bigger than the programs for C.
However, we need to modify the universal machine U of [3] so that it runs forever without producing any output or else produces a single positive integer 1, 2, 3, as output. This is easy to do by filtering the output of U. In addition, U requires programs to be self-delimiting, and not all bit strings are valid programs. Here we need all finite bit strings to be valid programs. This can be done by modifying U so that it ignores extra program bits and loops forever if it runs out of program bits.
Since U is universal, our software space includes all possible algorithms for calculating a positive integer.
The most straight-forward way of defining a metric on software space is as the number of point mutations required to get from one organism to another.
That's the general idea, but sometimes we need a more subtle way to define mutation distance, as −log2 of the probability of getting from one organism to another via a single mutation. That is the right way to think about mutation distance if you can go from any organism to another in one mutation with small but non-zero probability, which we need to do in model 1 to avoid getting stuck on a local fitness maximum.
The problem here is to get the random walk to cover the entire software space, i.e., be ergodic. If we pick a point mutation at random this will not happen. There is also the problem of being stuck in a local fitness maximum.
To avoid these problems, we can't just use point mutations, we need to do something more sophisticated. We need a small but non-zero probability of going from any software organism to any other.
One way to get this to work, which just happens to be the first way that I could think of, but which is no doubt only one of many possible ways to accomplish this, is as follows:
If the rules of the game are set up in this way, a single mutation consisting of many point mutations will eventually insert a prefix at the beginning of the software organism that computes an extremely large positive integer and that ignores the rest of the program string. The result is that one can show that with high probability the fitness of our software organism will grow faster than any computable function of the step N, which shows that genuine creativity is occurring.
I omit the detailed calculations and estimates.
Furthermore, fit organisms have many siblings. When an organism is added to the population, we check its fitness. If this is K, we add K additional copies of that organism to our population.
Programs that produce extremely large numbers K will quickly predominate. In fact, at stage N the organism with the most siblings will be the ≤ N-bit program that calculates the biggest number K. This value of K is by definition the Busy Beaver function of N bits, the largest output that can be produced by a program ≤ N bits in size, and grows faster than any computable function of N. This value of K is also the greatest positive integer with program-size complexity (as defined in [3]) ≤ N.
Model 2 is simpler than model 1, and it evolves more quickly, because it is evolving in parallel. Note that this is a deterministic model. However, the history leading from the initial organism to each individual organism at stage N, will still look like a random walk.
At stage N look at all programs ≤ N bits in size and select the one that produces the biggest output, which is in fact the Busy Beaver function of N bits, the largest output that can be produced by a program ≤ N bits in size.
This is even simpler than our previous model, and does just as well. Why did we take the trouble to formulate models 1 and 2, if model 3 is simpler and does just as well?
The answer is that the problem with model 3 is that it is not at all biological in spirit. This is not how Nature searches through the space of all possible organisms.
I should also emphasize that at best our models are the Platonic ideal that biological evolution attempts to approach, they are not the real thing we see happening around us. Nature does not have an oracle for the halting problem, and the number of organisms cannot increase exponentially indefinitely.
Furthermore, different models may be required to shed light on each of the problems (1) through (5) discussed in Section 1. In this paper, we have only addressed point (4), why does complexity increase? We get complexity to provably increase by choosing a fitness function that rewards creativity and by making sure that our random walks in software space are ergodic, i.e., cover the entire space.