Recent discoveries have unified the fields of computer science and information theory into the field of algorithmic information theory. This field is also known by its main result, Kolmogorov complexity. Kolmogorov complexity gives us a new way to grasp the mathematics of information, which is used to describe the structures of the world. Information is used to describe the cultural structures of science, legal and market institutions, art, music, knowledge, and beliefs. Information is also used in describing the structures and processes of biological phenomena, and phenomena of the physical world. The most obvious application of information is to the engineering domains of computers and communications. This essay will provide an overview of the field; only passing knowledge of computer science and probability theory is required of the reader.
We start by describing objects with binary strings. We take the length of the shortest string description of an object as a measure of that object's complexity. Let's look at the most traditional application of information theory, communications. The sender and receiver each know specification method L. Message x can be transmitted as y, such that L(y)=x. This is written "y:L(y)=x" and the ":" is read as "such that". The cost of transmission is the length of y, |y|. The least cost is
A universal description method should have the following properties:
-- it should be be independent of L, within a constant offset, so that we can compare the complexity of any object with any other object's complexity.
-- the description should in principle be performable by either machines or humans
Such a method would give us a measure of absolute information content, the amount of data which needs to be transmitted in the absence of any other a priori knowledge.
The description method that meets these criteria is the Kolmogorov complexity: the size of the shortest program (in bits) that without additional data, computes the string and terminates. Formally:
The conditional Kolmogorov complexity of string x given string y is
The length of the shortest program that can compute both x and y and a way to tell them apart is
1. Pi is an infinite sequence of seemingly random digits, but it contains only a few bits of information: the size of the short program that can produce the consecutive bits of pi forever. Informally we say the descriptional complexity of pi is a constant. Formally we say K(pi) = 0(1), which means "K(pi) does not grow".
2. A truly random string is not significantly compressible; its description length is within a constant offset of its length. Formally we say K(x) = Theta(|x|), which means "K(x) grows as fast as the length of x".
The combinations that can arise in a long series of coin flips can be divided into regular sequences, which are highly improbable, and irregular sequences, which are vastly more numerous. Wherever we see symmetry or regularity, we seek a cause. Compressibility implies causation. We can rid ourselves of the problematic nature of traditional inductive probability by redefining probability in terms of computational theory, via Kolmogorov complexity: The chance of generating string x in the form of a short program p:U(p)=x is
If the algorithm does not generate the exact string, we can include error (called "distortion" in information theory) as part of the description of the data:
The probability that a random program outputs x is called the "universal discrete semimeasure", and is given by
The algorithmic probability of x is given by
Coding theorem:
Conditional coding theorem:
Rudolf Carnap suggested that we predict by taking a weighted sum of all explanations in a given description language. Ray Solomonoff generalized this to the universal description language of Turing machines, to come up with the following universal prediction procedure: take a weighted combination of all programs that explain the data and see what they print next.
"It is vain to do with more what can be done with fewer". Kolmogorov complexity provides an objective measure for simplicity. Computable approximations for specific combinatorial data structures have been given starting with Rissannen and Wallace. The minimum description length principle says to minimize the weighted sum of the model's structural complexity, and the error between the model and the data.
This distance measure accounts for any kind of similarity between objects. This distance also measures the shortest program that transforms x into y and y into x. The minimal amount of irreversibility required to transform string x into string y is given by
Reversible computation can be used to minimize the amount of heat given off by a computing device. KR gives us the minimal amount of work required to create and destroy bits during a computation in which the intermediary program is not retained.
On the other hand, an object can be considered superficial when it is not very difficult to recreate another object to perform its function. For example, a garbage pit can be created by a wide variety of random sequences of truckfulls of garbage; it doesn't matter much in which order the trucks come.
More examples of sophistication are provided by the highly evolved structures of living things, such as wings, eyes, brains, and so on. These could not have been thrown together by chance; they must be the result of an adaptive algorithm such as Darwin's algorithm of variation and selection. If we lost the genetic code for vertebrate eyes in a mass extinction, it would take nature a vast number of animal lifetimes to re-evolve them. A sophisticated structure has a high replacement cost.
Bennett calls the computational replacement cost of an object its logical depth. Loosely speaking, depth is the necessary number of steps in the causal path linking an object with its plausible origin. Formally, it is the time required by the universal Turing machine to compute an object from its compressed original description.
In some cases it makes sense to define value in terms of replacement cost. If we see an organism or tradition of great replacement cost we may on those grounds alone deem it valuable; in such cases logical depth gives us an objective measure of that value.
K(x) is in general uncomputable, since we cannot be sure a program will halt when we are testing for correctness in generating the string. The good news is that we can derive computable "entropies", or descriptional complexities, for computable computation structures such as polynomials, decision tress, and finite automata. These can be used as objective functions for adaptive or "learning" algorithms which construct such computing structures; for example as "fitness functions" in genetic programming. In general
For example, when fitting a k-degree polynomial with precision d:
where s is some "scaling constant".
A binary decision tree that best describes a relational database:
For game strategies, the number of states in finite state machine that describes the strategy is used as measure of "bounded rationality" in game theory. A more accurate measure would be the FSM's time and space costs, plus it learnability cost, but the latter often dominates and is not well characterized by machine learning theory; the number of states is a crude approximation of the difficulty of an agent learning the strategy.
Gregory Chaitin, Algorithmic Information Theory
Ming Li & Paul Vitanyi, An Introduction to Kolmogorov Complexity and Its Applications