In this chapter I'll tell you where my theory goes from there. LISP is only the first step. To make further progress you need to construct a programming language to use to measure the size of programs. I won't give any proofs, but I'll outline the basic ideas. I'll give a survey of what you get, of the subject that I call algorithmic information theory (AIT), which is concerned with program-size complexity, algorithmic information content, and algorithmic incompressibility or randomness. We'll get to my most devastating incompleteness theorems, theorems involving the random number Ω, the halting probability.
The bottom line is that I can show that in some areas of mathematics, mathematical truth is completely random, unstructured, patternless and incomprehensible. In fact, by using the work of Y. Matijasevic and J. Jones on Hilbert's 10th problem, I can even show that this occurs in elementary number theory, in Peano arithmetic. I exhibit an algebraic equation involving only whole numbers (a so-called diophantine equation) in which the number of solutions jumps from finite to infinite completely at random as you vary a parameter in the equation. In fact, this gives us the bits of the random number Ω. So we will never be able to know whether or not my equation has a finite number of solutions in each particular instance. More precisely, these are irreducible mathematical facts. They can only be deduced by adding them as axioms. Settling N cases requires N bits of axioms.
So not only was Hilbert's faith in the axiomatic method wrong, in some cases it was completely wrong. Because to say that some mathematical truths are irreducible means that they cannot be compressed into axioms at all, they cannot be deduced from any principles simpler than they are.
Let me start with a brief outline of the history of my field. Some people have already published their versions of this history. Here I'd like to tell you how it looked from my vantage point. I'll tell how I saw it, how I experienced it.
It was very unfortunate that publication of the second half was delayed by the editor for three years. It was also unfortunate that the referee, Donald Loveland, immediately sent the entire uncut original manuscript to Kolmogorov in Moscow.
An earlier piece of work, involving the time and program-size complexity of infinite sets, was my third paper in the ACM Journal (1969). Then I turned to the size of programs for computing finite binary sequences, i.e., bit strings, and to the randomness or incompressibility of individual bit strings, which led to my first two papers in the ACM Journal. So these papers were not published in chronological order. [An even earlier piece of work led to my first publication, which was not in the ACM Journal. When I was in high school I programmed on the computer all the algorithms in E.F. Moore's paper ``Gedanken-experiments on sequential machines.'' ``Sequential machines'' were finite automata, and Moore's paper was in the very first book on theoretical computer science, C.E. Shannon and J. McCarthy's Automata Studies (Princeton University Press, 1956). This led to my first publication, written while I was in high school: ``An improvement on a theorem of E.F. Moore,'' IEEE Transactions on Electronic Computers EC-14 (1965), pp. 466-467. Moore's paper dealt with a toy model of the problem of scientific induction, namely the problem of identifying an automaton by giving it inputs and looking at the outputs—hence the title gedanken or thought experiments. And this involves a finite automata version of Occam's razor, because it's desirable to find the simplest finite automaton—the finite automaton with the smallest number of states—that explains a series of experiments on a black box. As I said when describing my APL2 physics course, wherever I look, I see program-size complexity! And as my gedanken-experiment project, my APL2 gallery, my Springer book, and this book all illustrate, in my opinion the best way to understand something is to program it out and see if it works on the computer.]
But a young Swede visiting Kolmogorov in Moscow, P. Martin-Löf, realizes that something is wrong, because Kolmogorov's proposal for defining infinite random strings turns out to be vacuous. Kolmogorov had required infinite random strings to have all prefixes be incompressible, but this fails because Martin-Löf notes that long runs of 0's and 1's produce logarithmic complexity dips. (I also noticed this problem, and proposed a different complexity-based definition of infinite random string, a more permissive one. This leads to the opposite problem, namely that it accepts some non-random strings.) So Martin-Löf abandons program-size complexity and proposes a constructive measure-theoretic definition of random infinite string. [A real number is Martin-Löf random iff it is not contained in any constructively covered set of measure zero.]
What do I do? I don't abandon program-size complexity, I stick with it. I change the definition to use self-delimiting binary programs. Then most N-bit strings require N + log_{2}N bit programs. It's now okay to demand that the complexity of each N-bit prefix of an infinite random string should never drop below N. The log_{2}N complexity dips now go from N + log_{2}N to N instead of from N to N − log_{2}N. (I'll explain this better later.) And my complexity-based definition of randomness now works for both finite and infinite strings. It turns out to be equivalent to Martin-Löf's for infinite strings.
I'm invited to speak on this, the second major version of AIT, at the opening plenary session of the 1974 IEEE International Symposium on Information Theory in Notre Dame, Indiana, with several well-known Soviet information-theorists in attendance. I publish this new version of AIT in my 1975 ACM Journal paper ``A theory of program size formally identical to information theory,'' and later, in more complete form, in my 1987 Cambridge University Press monograph Algorithmic Information Theory.
Meanwhile another Russian, L.A. Levin, also realizes that self-delimiting programs are necessary, but he doesn't get it all right, he doesn't do as good a job. For example, he doesn't realize, as I did, that the definition of relative complexity also has to be changed. (I'll explain this later, but the basic idea is that you're not given something for free directly, you're given a minimal-size program for it instead.)
And to my knowledge, no one else realizes that AIT can be reformulated as a theory of the size of real programs in a usable programming language, one based on LISP. But that's not too surprising, because I had to invent the programming language and write all the software for running it. That's the third major reformulation of AIT, and this time I'm the only one who does it. It's presented in my 1998 Springer-Verlag book The Limits of Mathematics.
Anyway, in my opinion AIT really begins with my 1975 ACM Journal paper ``A theory of program size formally identical to information theory;'' the rest was the pre-history of the field!
On the side, just for the fun of it, I also developed three different theories of LISP program-size complexity. These are in the second World Scientific collection of my papers, the ``autobiography'' published in 1992. I did this work because (a) I like LISP and (b) it's nice to look at the size of real programs and (c) because these theories work much like one of my original theories that measured Turing machine programs in states. My LISP program-size work resurrected one of my first efforts, dealing with what I called ``bounded-transfer'' Turing machines. This is a somewhat peculiar machine model, but I was fond of these youthful ideas, and hated to see them completely disappear in the dust bin of history. I felt that my work on LISP confirmed the validity of some of my youthful intuitions about the right way to develop a theory of the size of real programs, it was just that I hadn't applied these ideas to the right programming language!
I also retain an interest in applying program-size complexity measures to computing infinite sets. This part of the theory is much less developed than the program-size complexity of computing individual finite objects. I have only one paper on this subject, ``Algorithmic entropy of sets.'' It's in my first World Scientific volume, the one published in 1987, and, in a second edition, in 1990. However I do use this to define the complexity of a formal axiomatic system as the size of the smallest program for generating all of its theorems. I think that this is a better definition than the one I used in Chapter V that the complexity of a formal axiomatic system is given by the size of the smallest program for its proof-checking algorithm. But of course they are closely related. Many interesting open questions remain in the part of the theory dealing with infinite computations instead of finite ones.
What I've presented above is a history of AIT proper, not of its application to metamathematics and epistemology. In my first ACM Journal paper I prove that program-size complexity is uncomputable, which I was not the only person to notice. But I am the only person to realize that AIT sheds dramatic new light on the incompleteness phenomenon discovered by Gödel, which is not at all equivalent to the remark that program-size is uncomputable, a weak and abstract observation. Why is this? It's because one can begin to discuss the information content of axioms and also because in a sense a formal axiomatic system amounts to a computation in the limit of infinite time, so that saying that something cannot be proven (with proofs of any size) is stronger than saying that it cannot be computed. (Technically, what I'm saying amounts to the observation that a formal axiomatic system is ``recursively enumerable,'' not ``recursive.'') [I believe that the current terminology is ``computably enumerable'' and ``computable.'' At any rate, the meaning is this. The set of theorems of a formal axiomatic system has the property that there's an algorithm for generating its elements (in some arbitrary order). But in general there's no algorithm to decide if something is in the set of theorems or not. (That's the entscheidungsproblem, the decision problem, in the title of Turing's 1936 paper, where he proved that these two ways of defining a set are different.)]
I realize this at age 22 during a visit to a university in Rio de Janeiro in 1970. It's just before Carnival in Rio and I recall learning there the sad news that Bertrand Russell, one of my heroes, had died. I show that a formal axiomatic system cannot establish any lower bounds on the program-size complexity of individual objects, not if the lower bound is substantially larger than the complexity of the axioms themselves. This marks the beginning of my information-theoretic approach to incompleteness.
When Jacob Schwartz of the Courant Institute visits Buenos Aires soon after, he is astonished to hear my ideas and encourages me to develop them. [I lived from 1966 to 1975 in Buenos Aires—where I joined IBM in 1967—and the rest of the time in New York.] I later discover that he had been in Moscow discussing AIT. (I realize this when I see an acknowledgement of his participation in a survey paper on AIT by A.K. Zvonkin and Levin in Russian Mathematical Surveys.) Schwartz's astonishment shows clearly that the essential point had not been grasped by the Moscow school of AIT.
I publish this idea in Rio in a research report in 1970, in an abstract in the AMS Notices in 1970, in a short paper in the ACM SIGACT News in 1971, in an invited paper in the IEEE Information Theory Transactions in 1974, in a long paper in the ACM Journal in 1974—my fourth ACM Journal paper—and in an article in Scientific American in 1975.
I send the galley proofs of my invited paper in the IEEE Information Theory Transactions to Gödel in early 1974 after a phone conversation with him requesting an interview. He reads my paper and in a second conversation grants me an appointment, but this never comes to pass due to bad weather and the fact that my visit to the U.S.A. is coming to an end.
My second major period of metamathematical activity is due to an invitation in 1986 from Cambridge University Press to write the first book in their series on theoretical computer science. In their letter of invitation they explain that I was picked to be first in order to make the point that computer science has deep intellectual significance and is not just software engineering.
It is then that I realize that I can dress up my random Ω number, the halting probability, as a diophantine equation, and that there is therefore randomness in arithmetic, in elementary number theory. And I am also able to show that an N-bit formal axiomatic system can determine at most N bits of Ω, even if the bits are scattered about instead of all at the beginning.
My book about this, Algorithmic Information Theory, is published by Cambridge University Press in 1987 and causes a commotion. In 1988 Ian Stewart praises it in a news item in Nature entitled ``The ultimate in undecidability.'' Later in 1988 I'm surprised to find an article with my photograph entitled ``Une extension spectaculaire du théorème de Gödel: l'équation de Chaitin'' (A spectacular extension of Gödel's theorem: Chaitin's equation) by Jean-Paul Delahaye in La Recherche. I'm asked to write about this in Scientific American, in La Recherche, and in New Scientist.
Two of the high points of my career follow. In 1991 John Casti and Hans-Christian Reichel invite me to talk about my work in Gödel's old classroom in Vienna. John announces my visit in a full-page article with my photo entitled ``Gödeliger als Gödel'' (Out-Gödeling Gödel) in the Vienna newspaper Der Standard. And in 1992 I visit Cambridge University, where Russell and Turing worked. The occasion is a high-level meeting on reductionism, and my talk is recorded as my paper on ``Randomness in arithmetic and the decline and fall of reductionism in pure mathematics'' in the book J. Cornwell, Nature's Imagination, Oxford University Press, 1995. This paper, perhaps my most popular, is reprinted several times.
My third major period of metamathematical activity is started by an invitation from George Markowsky to give a course at the University of Maine in Orono in 1994. I realize how to program my theory on the computer, and I include in the course a much simpler proof of my result about determining bits of Ω. [I'm talking about my proof that an N-bit formal axiomatic system can determine at most N bits of Ω, even if the bits are scattered about instead of all at the beginning. In my course I use the simple Berry-paradox program-size proof in my ``Information-theoretic incompleteness'' paper in Applied Mathematics & Computation (1992), instead of the original complicated measure-theoretic proof in my Cambridge University Press monograph.] I refine the course greatly as the result of a second invitation. This time I'm invited by Veikko Keränen to give a course in Rovaniemi, Finland, in May 1996. It's an amazing experience in every possible way. It never gets dark, and Veikko and I drive to Norway's North Cape, the top of Europe. The final result is my 1998 book The Limits of Mathematics, actually published at the end of 1997, which sees the light only because of the enthusiastic support of my good friend Cris Calude.
Again the results exceed all expectations. The Limits of Mathematics is announced by Springer-Verlag, the world's leading math publisher, with these words: ``Capture a piece of mathematics history-in-the-making with Gregory Chaitin's New Book The Limits of Mathematics.'' The ``energetic lectures'' and ``exuberant style'' in this book noted by the Library of Science book club, reflect both my astonishment at being able to present my strongest metamathematical results so simply, and also the encouragement that I received from George and Veikko and the exhilarating experience of giving my course twice to interested and able audiences in beautiful environments.
Two delightful consequences are that in 1998 an interview with me is the lead article on the cover of the Sunday magazine of a Buenos Aires newspaper Página/12, and I'm also interviewed in the Sunday magazine of the Lisbon, Portugal newspaper Expresso. These magazine interviews include photographs of me, my home, and my Springer book, an amazing experience for a mathematician whose main interest is epistemology!
It's been a wonderful life. I never imagined as a child that things could go this way, or that it could pass so quickly...
My biggest disappointment is that I'm unable to use program-size complexity to make mathematics out of Darwin, to prove that life must evolve, because it's very hard to make my kind of complexity increase. But Wolfram uses the ubiquity of universality to argue that there is nothing to explain, and perhaps he's right... I'll describe Wolfram's ideas in the concluding chapter.
This personal story is designed to humanize what would otherwise be a dry piece of mathematics, and to show what an adventure discovery can be, to show the blood, sweat and tears... But now let me quickly outline the mathematics...
Then we define the complexity or algorithmic information content H(X) of a LISP S-expression X to be the size in bits |p| of the smallest program p that produces X.
The result of this is that most N-bit strings require programs very close to N bits long. These are the random or incompressible N-bit strings.
Unfortunately this theory still has some serious problems. One symptom of this is that complexity is not additive. It is not the case that the complexity of a pair is bounded by the sum of the individual complexities. I.e., it's not the case that
In other words, you can't combine subroutines because you can't tell where one ends and the other begins. To solve this, let's make programs ``self-delimiting''. [This (sub)additivity property played a big role in my thinking: (a) because for a programmer it's a very natural requirement and (b) because it was valid in my two earlier theories of Turing machine program-size complexity measured in states and (c) because in fact it had played an absolutely fundamental role in my theory of program-size for ``bounded-transfer'' Turing machines. I had regretfully given up additivity in going from my Turing machines theories to binary programs. But I badly wanted additivity back! That's why I came up with self-delimiting binary programs. I had already been working with self-delimiting programs before... By the way, remember that I showed that LISP complexity is additive in Chapter V? That's another reason that I love LISP!]
The fact that ``read-bit'' does not return an end-of-file condition but instead kills the computation is absolutely crucial. This forces the program p to indicate its own size within itself somehow, for example, by using a scheme like the length header that's placed at the beginning of variable-length records. Now most N-bit strings X have complexity greater than N, because programs not only have to indicate the content of each bit in X, they also have to indicate how many bits there are in order to make the program self-delimiting.
The final result is that most N-bit strings X now have complexity H(X) very close to N + H(N). That's N plus the size in bits of the smallest program to calculate N, which is usually about N + log_{2}N. So roughly speaking
And now information is additive: H((X Y)) ≤ H(X) + H(Y) + a constant number of bits c required to stitch the two subroutines for X and Y together.
Solomonoff had considered all the programs that produce a given output, not just the smallest ones, but he had not been able to get it to work. The sums over all programs that he defined diverged, they always gave infinity.
Well, with self-delimiting programs everything works like a dream. In addition to the complexity H(X) of a LISP S-expression X, which is the size of the smallest program for X, we can define a complexity measure that includes all programs for X. That's the probability that a program produced by coin tossing produces X.
The probability that a program produced by coin tossing produces X turns out to be
I.e., each k-bit program p that produces X adds 2 to the minus k to the probability P(X) of producing X.
I'm proud of my theorem in my 1975 ACM Journal paper that the complexity H and the probability P are closely related. In fact, the difference between H(X) and −log_{2} P(X) is bounded.
In other words, most of the probability of computing X is concentrated on the elegant programs for calculating X. And this shows that the elegant program is essentially unique, i.e., that Occam's razor picks out a bounded number of possibilities. And this connection between program size and probability unlocks the door to other deep results. I use it to prove a beautiful decomposition theorem.
This states that the difference between (the complexity of a pair X, Y) and (the sum of the complexity of X plus the relative complexity of Y given X) is bounded. What's the relative complexity of Y given X? It's the size of the smallest program to calculate Y if we are given an elegant program to calculate X for free.
An important corollary concerns the mutual complexity or information content H(X:Y). That's defined to be the extent to which the complexity of a pair is less than the sum of the individual complexities.
Here's my result
H(X:Y) | = H(X) − H(X|Y) + O(1) |
= H(Y) − H(Y|X) + O(1) |
In other words, I show that within a bounded difference the mutual complexity or the mutual information content is also the extent to which knowing Y helps us to know X and the extent to which knowing X helps us to know Y.
Finally, there is the important concept of (algorithmic) independence. Two objects are independent if the complexity of the pair is equal to the sum of the individual complexities:
More precisely, they are independent if their mutual complexity is small compared to their individual complexities. For example, two N-bit strings are algorithmically independent if their mutual complexity is H(N), i.e., if the only thing that they have in common is their size. Where can we find such a pair of strings? That's easy, just take the two halves of a random (maximum complexity) 2N-bit string!
Now what's a finite random string? Well, the most random N-bit strings X have H(X) close to N + H(N). As I said before, most N-bit strings have close to this maximum possible complexity, i.e., are highly random. And as the complexity of an N-bit string drops below this, the string gets less and less random, and there are fewer and fewer such strings. [More precisely, the number of N-bit strings X such that H(X) < N + H(N) − K is less than 2^{N−K+c}.] But where should we draw the line? How low should we let H(X) drop before we reject X as random? Well, as I said before, it's a matter of degree. But if you insist that I should provide a sharp cutoff, I can do it. How? The answer is provided by looking at infinite bit strings, in other words, at real numbers in binary notation, with an infinite number of binary digits of precision.
C.P. Schnorr showed that an infinite bit string X satisfies all computable statistical tests of randomness (which is Martin-Löf's randomness definition) iff there is a constant c such that
where X_{N} is the first N bits of X (which is my randomness definition). In fact, I show that if this is the case then H(X_{N}) − N must go to infinity. [In my Cambridge book, I prove that four randomness definitions for infinite bit strings are equivalent, including one by R.M. Solovay.] [More precisely, a real number is Martin-Löf random iff it is not contained in any constructively covered set of measure zero.]
So let's draw the cutoff for finite randomness when the complexity of an N-bit string drops below N. Then we can define an infinite random string X to be one with the property that almost all (all but finitely many) of its prefixes X_{N} are finite random strings.
Some examples will clear the air. First, what's the least complex N-bit string? Well, obviously the string of N 0's. Its complexity is within a fixed number of bits of H(N). In other words, to calculate it we only need to know how many bits there are, not what they are.
Second, consider an N-bit elegant program. It turns out that its complexity is very close to N, within a fixed number of bits. So elegant programs are right on the borderline between structure and randomness. They have just enough structure to be self-delimiting!
Third, consider the N-bit base-two numeral for the number of N-bit strings which have exactly the maximum possible complexity. I showed in a 1993 note in Applied Mathematics & Computation that this number is itself an N-bit string within a fixed number of bits of the maximum possible complexity, which is N + H(N).
So these are three milestones on the complexity scale from least to most random.
Define the halting probability for my computer U as follows:
Since Ω is a probability, we have
Now let's write Ω in binary, i.e., in base-two notation, like this
whatever it is. Knowing the first N bits of this real number Ω would enable us to solve the halting problem for all programs for U up to N bits in size. Using this fact, I show that Ω is an algorithmically incompressible real number. I.e.,
where Ω_{N} is the first N bits of Ω. It follows that the bits of Ω satisfy all computable statistical tests for randomness. Separately, I show that the bits of Ω are irreducible mathematical facts: it takes N bits of axioms to be able to determine N bits of Ω. More precisely, there is a constant c' such that it takes N + c' bits of axioms to be able to determine N bits of Ω.
In my 1998 Springer-Verlag book, I actually determine these constants c and c'. c = 8000 and c' = 15328!
has finitely or infinitely many natural number solutions (each solution is a 20,000-tuple with the values for X_{1}, ..., X_{20000}) if the Kth bit of Ω is, respectively, a 0 or a 1. Therefore determining whether this equation has finitely or infinitely many solutions is just as difficult as determining bits of Ω. [Actually, Hilbert in 1900 had asked a slightly different question. He had asked for a method to determine if an arbitrary diophantine equation has a solution or not. I'm interested in whether the number of solutions is finite. No solution is a finite number of solutions... Matijasevic showed in 1970 that Hilbert's 10th problem is equivalent to the halting problem. But Hilbert's 10th problem and the halting problem do not give randomness. They're not independent, irreducible mathematical facts. Why not? Because in order to solve N instances of either problem we just need to know how many of the N equations have a solution or how many of the N programs halt. And that's much less than N bits of information!]
I explain the detailed construction of this equation in my 1987 Cambridge University Press monograph. The final version of my software for constructing this equation is in Mathematica and C and is available from the Los Alamos e-print archives at http://xxx.lanl.gov as report chao-dyn/9312006.
Phew! That's a lot of math we've just raced our way through! The intent was to show you that program-size complexity is a serious business, that AIT is a serious, ``elegant'' (in the non-technical sense) well-developed field of mathematics, and that if I say that Ω is random, irreducible mathematical information, I know what I'm talking about.
Now it's time to wrap things up!