Then, as is pointed out in Turing's original paper (1936), and as was emphasized by Post in his American Mathematical Society Bulletin paper (1944), the set X of all theorems, consequences of the axioms, can be systematically generated by running through all possible proofs in size order and mechanically checking which ones are valid. This unending computation, which would be monumentally time-consuming, is sometimes jocularly referred to as the British Museum algorithm.
The size in bits H(X) of the program that generates the set X of theorems---that's the program-size complexity of X---will play a crucial role below. Roughly speaking, it's the number of bits of axioms in the formal theory that we are considering. H(X) will give us a way to measure the algorithmic complexity or the algorithmic information content of a formal axiomatic theory.
But first let's retrace history, starting with a beautiful antique, Gödel's incompleteness theorem, the very first incompleteness theorem.
Therefore formal axiomatic theories, if they only prove true theorems, are incomplete, because they do not prove all true statements. And so true and provable turn out to be rather different!
How does Gödel's proof work? Gödel cleverly constructs an arithmetical or number-theoretic assertion that refers to itself and its unprovability indirectly, via the (Gödel) numbers of the statements and proofs within the formal theory. In other words, he numbers all the statements and proofs within the formal axiomatic theory, and he can then construct a (very complicated) bona fide mathematical assertion that states that it itself is unprovable. The self-reference is indirect, since Gödel's self-referential statement cannot contain its own Gödel number.
Wonderful as it is, Gödel's proof does not involve three of the big ideas of the 20th century, algorithm, information and randomness. The first step in that direction was taken by Turing only five years later.
But before discussing Turing's work, what is an algorithm?
In LISP, which is my favorite programming language, applying the function f to the operands x and y, f(x,y), is written (f x y).
Programs and data in LISP are all S-expressions, which are lists with sublists, enclosed in parentheses and with successive elements separated by blanks. For example, (A BC (123 DD)) is an S-expression. The S in S-expression stands for "symbolic."
For example, let's define set membership in LISP.
(define (in-set? member set) (if (= () set) false (if (= member (head set)) true (in-set? member (tail set)) )) )This defines (in-set? member set) to be false if the set is empty, true if the member is the first element in the set, and to recursively be (in-set? member [the rest of the set]) otherwise.
Let me explain some of this: (if x y z) yields/picks y or z depending on whether or not x is true. And (= x y) checks if x = y, yielding true or false.
Unfortunately, for historical reasons, head is actually written car and tail is actually written cdr.
Then (in-set? (' y) (' (x y z))) yields true, and (in-set? (' q) (' (x y z))) yields false.
Here ' or quote stops evaluation and contains unevaluated data.
In summary, a LISP program isn't a list of statements that you execute or run, it's an expression to be evaluated, and what it does, is it yields a value. In other words, LISP is a functional programming language, not an imperative programming language.
Imagine a numbered list of all possible computer programs for computing real numbers. That is, a list of all possible computer programs in some fixed language, ordered by size, and within programs of the same size, in some arbitrary alphabetical order. So R(N) is the real number (if any) that is computed by the Nth program in the list (N = 1, 2, 3, ...).
Let R(N,M) be the Mth digit after the decimal point of the Nth computable real, that is, the real R(N) calculated by the Nth program. Define a new real R* whose Nth digit after the decimal point, R*(N), is 3 if R(N,N) is not equal to 3, and otherwise is 2 (including the case that the Nth computer program never outputs an Nth digit). Then R* is an uncomputable real, because it differs from the Nth computable real in the Nth digit. Therefore there cannot be any way to decide if the Nth computer program ever outputs an Nth digit, or we could actually compute R*, which is impossible.
Corollary: No formal axiomatic theory can always enable you to prove whether or not the Nth computer program ever outputs an Nth digit, because otherwise you could run through all possible proofs in size order and compute R*, which is impossible.
Note: Whether the Nth computer program ever outputs an Nth digit is a special case of the halting problem, which is the problem of determining whether or not a computer program ever halts (with no time limit). If a program does halt, you can eventually determine that, merely by running it. The real problem is to decide that a program will never halt, no matter how much time you give it.
What is the "self-delimiting binary universal computer" U that we use below?
U's program is a finite bit string, and starts off with a prefix that is a LISP expression. The LISP expression prefix is converted into binary, yielding 8 bits per character, and it's followed by a special punctuation character, 8 more bits, and that is followed by raw binary data. The LISP prefix is evaluated or run and it will read the raw binary data in one bit at a time without ever being allowed to run off the end of the program.
In other words, the LISP prefix must ask for exactly the correct number of bits of raw binary data. If it requests another bit after reading the last one, this does not return a graceful "end of file" condition, it aborts the computation.
The fact that there is no punctuation marking the end of the raw binary data, which is also the end of the entire program, is what forces these machine-language programs to be self-delimiting. In other words, the end of the binary program is like a cliff, and the computer U must not fall off!
What is the halting probability Ω for U?
This halting probability can be defined in such a manner that it includes binary programs p of every possible size precisely because these p must be self-delimiting. That is to say, precisely because U must decide by itself where to stop reading the program p. U must not overshoot, it cannot fall off the cliff, it must read precisely up to the last bit of p, but not beyond it.
Knowing the first N bits of the base-two representation of the real number Ω, which is a probability and therefore between zero and one, answers the halting problem for all programs up to N bits in size. So if you knew the first N bits of Ω after the decimal or the binary point, that would theoretically enable you to decide whether or not each binary computer program p up to N bits in size halts when run on U.
Would this be useful? Yes, indeed! Let me give an example showing just how very useful it would be.
Let's consider the Riemann hypothesis, a famous mathematical conjecture that is still open, still unresolved. There is a Riemann-hypothesis testing program that systematically searches for counter-examples and that halts if and only if it finds one and the Riemann hypothesis is false. The size in bits of this program is the program-size complexity of testing the Riemann hypothesis this way. And if this program is H(Riemann) bits long, knowing that many initial bits of Ω would settle the Riemann hypothesis! It would enable you to tell whether or not the Riemann hypothesis is true.
Unfortunately the method required to do this, while theoretically sound, is totally impractical. The computation required, while finite, is much, much too long to actually carry out. The time needed grows much, much faster than exponentially in H(Riemann), the number of bits of Ω that we are given. In fact, it grows as the time required to simultaneously run all programs on U up to H(Riemann) bits in size until all the programs that will ever halt have done so. More precisely, you have to run enough programs for enough time to get the first H(Riemann) bits of Ω right, which because of carries from bits that are further out may actually involve programs that are more than H(Riemann) bits long.
And precisely because the bits of Ω are so useful, it turns out that they are irreducible mathematical information, that they cannot be derived or deduced from any simpler principles. More precisely, we have the following incompleteness result: You need an N-bit formal axiomatic theory (that is, one that has an N-bit algorithm to generate all the theorems) in order to be able to determine the first N bits of Ω, or, indeed, the values and positions of any N bits of Ω.
Actually, what I show in Chaitin (1998) is that an N-bit theory can't determine more than N + c bits of Ω, where the constant c is 15328.
Let's restate this. Consider a formal axiomatic theory with the set of theorems X and with algorithmic complexity or algorithmic information content H(X). Then if a statement such as
We can also describe this irreducibility non-technically, but very forcefully, as follows: Whether each bit of Ω is a 0 or a 1 is a mathematical fact that is true for no reason, it's true by accident!
Ω is actually just one piece of my algorithmic information theory (AIT), it's the jewel that I discovered while I was developing AIT. Let me now give some highlights of AIT.
What else can we do using the computer U that we used to define Ω? Well, you should look at the size of programs for U. U is the yardstick you use to measure algorithmic information. And the unit of algorithmic information is the 0/1 bit.
You define the absolute algorithmic information content H(X) of an object (actually, of a LISP S-expression) X to be the size in bits of the smallest program for U to compute X. The joint information content H(X,Y) is defined to be the size in bits of the smallest program for U to compute the pair X, Y. (Note that the pair X, Y is actually (X Y) in LISP.) The relative information content H(X|Y) is defined to be the size in bits of the smallest program for U that computes X from a minimum-size program for Y, not from Y directly.
And the complexity H(X) of a formal axiomatic theory with theorem set X is also defined using the computer U. (I glossed over this point before.) H(X) is defined to be the size in bits of the smallest program that makes U generate the set of theorems X. Note that this is an endless computation. You may think of H(X) as the number of bits of information in the most concise or the most elegant set of axioms that yields the set of theorems X.
Now here are some of the theorems that you can prove about these concepts.
First of all, let's consider an N-bit string X. H(X) is usually close to N + H(N), which is approximately N + log_{2}N. Bit strings X for which this is the case are said to be algorithmically random, they have the highest possible information content.
On the other hand, an infinite sequence of bits X is defined to be algorithmically random if and only if there is a constant c such that H(the first N bits of X) is greater than N − c for all N. And, crucial point, the base-two representation for Ω satisfies this definition of algorithmic randomness, which is one of the reasons that Ω is so interesting.
Algorithmic information is (sub)additive:
Finally, here are some subtle results that relate mutual and relative information:
These results are quoted here in order to show that Ω isn't isolated, it's part of an elegant theory of algorithmic information and randomness, a theory of the program-size complexity for U.
Now let me tell you what I think is the significance of these incompleteness results and of Ω.
I think that incompleteness cannot be dismissed and that mathematicians should occasionally be willing to add new axioms that are justified by experience, experimentally, pragmatically, but are not at all self-evident. [In my opinion that P is not equal to NP is a good example of such a new axiom.] Sometimes to prove more, you need to assume more, to add new axioms! That's what my information-theoretic approach to incompleteness suggests to me.
Of course, at this point, at the juncture of the 20th and the 21st centuries, this is highly controversial. It goes against the current paradigm of what mathematics is and how mathematics should be done, it goes against the current paradigm of the nature of the mathematical enterprise. But my hope is that the 21st century will eventually decide that adding new axioms is not at all controversial, that it's obviously the right thing to do! However this radical paradigm shift may take many years of discussion and thought to be accepted, if indeed this ever occurs.
For further discussion of this quasi-empirical, experimental mathematics viewpoint, see Chaitin (1998, 1999, 2002), Tymoczko (1998), Bailey (2002).
For superb histories of many aspects of 20th century thought regarding the foundations of mathematics that we have not touched upon here, see Grattan-Guinness (2000), Tasic (2001).
In my opinion the following line of research is relevant and should not have been abandoned:
Will mathematics become more like physics, more experimental, more quasi-empirical, with fewer proofs?
For a longer discussion of this, see the chapter on mathematics in the third millennium in Chaitin (1999).
To a mathematician a UTM is a formal, artificial, unambiguous language for formulating algorithms, a language that can be interpreted mechanically, and which enables you to specify any algorithm, all possible algorithms. That's why it's called "universal."
What is a UTM to a physicist? Well, it's a physical system whose repertoire of potential behavior is extremely rich, in fact maximally rich, universal, because it can carry out any computation and it can simulate the behavior of any other physical system.
These are two sides of a single coin.
And, in a sense, all of this was anticipated by Turing in 1936 when he used the word machine. Also the "halting problem" almost sounds like a problem in physics. It sounds very down-to-earth and concrete. It creates a mental image that is more physical than mathematical, it sounds like you are trying to stop a runaway locomotive!
After all, what you can compute depends on the laws of physics. A different universe might have different computers. So, in a way, Turing's 1936 paper was a physics paper!
It is sometimes useful to think of physical systems as performing algorithms, and of the entire universe as a single giant computer. Edward Fredkin was one of the earliest proponents of this view. See Wright (1988).
There is an increasing amount of work by physicists that suggests that it is fertile to view physical systems as information-processing systems, and that studies how physical systems process information. The extremely popular field of research of quantum computation and quantum information [Nielsen (2000)] certainly follows this paradigm.
And there are also suggestions from black hole thermodynamics and quantum mechanics that the physical universe may actually be discrete, not continuous, and that the maximum amount of information contained in a physical system is actually finite. A leading researcher in this area is Jacob Bekenstein, and for more on this topic, see the chapter on the holographic principle in Smolin (2001).
Wolfram (2002) is a treasure-trove of simple combinatorial (symbolic, discrete, non-numerical) algorithms with extremely rich behavior (in fact, universal behavior, equivalent to a UTM, a universal Turing machine, in other words, that can perform an arbitrary computation and simulate an arbitrary physical system). These are superb building blocks that God might well have used in building the universe! Here I refer to high-energy or particle physics. Time will tell---we will see! Wolfram also supports, and himself employs, an experimental, quasi-empirical approach to doing mathematics.
Let me also cite the physicist who might be considered the inventor of quantum computing, and also cite a journalist. See the chapter on "Universality and the limits of computation" in Deutsch (1998), and Siegfried (2000) on the physics of information and the seminal ideas of Rolf Landauer.
I should mention that my own work on AIT in a sense belongs to this school; it can be viewed as an application of the physical notion of entropy (or disorder) to metamathematics. In other words, my work on Ω in a sense treats formal axiomatic theories as if they were heat engines. That is how I show that Ω is irreducible, using general, "thermodynamical" arguments on the limitations of formal axiomatic theories.
Another example of ideas from physics that are invading computer science are the phase changes that mark a sharp transition from a regime in which an algorithm is fast to a regime in which the algorithm is extremely slow, for instance the situation described in Hayes (2002).
Ergodic theory says that things get washed out and less interesting as time passes. The theory I want would show that interesting things (life! organisms! us!) emerge and evolve spontaneously, and that things get more and more interesting, not less and less interesting.
My old attempt at a theory [Chaitin (1970, 1979)] considered cellular automata and proposed using mutual algorithmic information to distinguish a living organism from its environment. My idea was that the parts of an organism have high mutual information.
Wolfram (2002) sustains the thesis that life is not unusual. He claims that there is no essential difference between us and any other universal Turing machine. Furthermore, according to Wolfram, most non-trivial physical and combinatorial systems are universal Turing machines, UTMs. (A UTM is a physical system whose behavior is as rich as possible because it is a general-purpose computer that can perform an arbitrary computation, in other words, that can simulate any algorithm and any other physical system.) Therefore, according to Wolfram, there is nothing to Darwin, nothing to understand. The evolution of life is a non-issue. According to Wolfram you get there, you get life, right away, all at once.
Wolfram's thesis, while interesting, is not, I believe, the entire story. Universal Turing machines, computation, may be ubiquitous in nature, but the amount of software a UTM has is not taken into account by Wolfram. And that, I believe, is what is actually evolving! After all, DNA is essentially digital software, and we have much more DNA than viruses and bacteria. Our program-size complexity is higher, H(human) is much greater than H(bacteria).
This suggests to me a new toy model of evolution very different from the cellular automata model that I originally considered. My new idea is to model life, an ecology, as a collection of UTMs, and to study how their software evolves in complexity. The problem is how to model the environment, more precisely, the interactions of organisms with each other and with their environment. Let me emphasize that in such a model the problem is not to distinguish an organism from its environment, which before I attempted to do using mutual information, it is to model interactions, so that the organisms are not like Leibniz's windowless monads! And then of course to prove that the software complexity will evolve...
For more on mutual information, see Chaitin (2001). For further discussion of my hopes for a theoretical biology, and for my thoughts on biology in general, see Chaitin (2002). For von Neumann's seminal work in this area, see von Neumann (1966). Two recent books on biology that particularly impressed me are Maynard Smith (1999), and Kay (2000).
See Nørretranders (1998) for some interesting discussions of precisely these questions.
However, again, I don't just want an interesting discussion, I want a mathematical theory with beautiful theorems and proofs! Human intelligence may just be a very complicated piece of engineering, or there may be some profound, basic, as yet unknown concepts at play, and a fundamental theory about them. And these two possibilities are not mutually exclusive. Time will tell!
University of Auckland, May 2002