Difference between revisions of "Information theory"

Revision as of 02:16, 13 July 2009

Information theory is a framework for understanding the transmission of data and the effects of complexity and interference with these transmissions.

Shannon information

Claude Shannon developed a model of information transmission in terms of information entropy. It was developed to describe the transfer of information through a noisy channel.

Digitized information consists of bits with quantized amounts. (Computers typically use a binary system, with 0 or 1 as allowed values. Genetic information can be thought of as digitized, with A, C, G, and T as allowed values.) If each position has one specific possible value, it can be said to have low information content, or in more colloquial terms, "no news." As more values are possible at each point in the signal, it becomes less predictable, and hence the information content of any particular "message," or instance of the signal, increases.

Shannon developed his theory to provide a rigorous model of the transmission of information. Importantly, information entropy provides an operational and mathematical way to describe the amount of information that is transmitted, as well as the amount of redundancy required to get a certain amount of information reliably through a band-limited noisy channel.

In genetics, a point mutation increases the information entropy of a DNA base pair. However, natural selection counteracts this increase through eliminating organisms with harmful mutations and consequent higher information entropy (or colloquially, lower information content).^[1] While information theory does not describe how a sequence of DNA bases is expressed into features for development, it clearly indicates that genetic information is transmitted from one generation to another mathematically. Any feature of a string that preserves fitness will have a lower information entropy or higher information content than a random string. Richard Dawkins's weasel program that investigates cumulative selection shows a lowering of information entropy.^[2]

While there are similarities between the mathematical form used to describe thermodynamic entropy and information entropy, the former refers exclusively to the distribution of energy. Entropy increases in thermodynamics in a closed system according to the Second Law, but it is unclear that thermodynamic entropy and information have anything in common besides mathematical notation. Furthermore, the relevance^{[citation needed]} of what the closed system encompasses, regarding genetic information, is unclear. At the least, natural selection influences the propagation of genetic coding.

Creationists really don't like this stuff, but won't say why. It's likely because they don't want an actual definition of information that can be argued against.^[3]

Kolmogorov complexity

Kolmogorov complexity (also known as Chaitin information, or algorithmic information) deals with the use of algorithms to compress or decompress information.^[4] Computer scientists developed it to discuss how to compress data in the most efficient way possible to take up less disk space.

The Kolmogorov complexity depends on the number of steps that an algorithm would need to take to reproduce the information (sometimes called the "edit distance"). Thus "A²⁰" can be thought of as a compression of "AAAAAAAAAAAAAAAAAAAA," and "(AB)⁹" can be thought of as a compression of "ABABABABABABABABAB." Any instruction including insertion, repeating, deletion, etc. can change the Kolmogorov complexity. Thus, the Kolmogorov complexity can be thought of as the maximum amount of information "in the string" or "in the sequence."

The Kolmogorov complexity depends entirely on the algorithm used. Hence, while there are uses in genetics, determining the change in Kolmogorov complexity would require a description of all the processes used to reproduce the developmental information from the DNA sequence; one cannot tell the amount of information (or Kolmogorov complexity) by just looking at a string of letters, symbols or DNA. (This is part of the reason why the amount of information in the words "car" and "vehicle" cannot be compared as it is dependent on the algorithms of linguistic interpretation, and why the number of letters is insignificant.) Notably, because the processes "change" (or "mutate") and "delete" can be thought of as an additional algorithmic step, they can increase the Kolomogorov complexity (or information content). More significant is that potentially they change the content.^[5]

Note that any comparison of information "in the string" used by creationists (in the guise of meaning) is Kolmogorov complexity, while the "increase of noise" or "information loss by loss of DNA sequence fidelity" by mutations usually refers to Shannon's information entropy. The two cannot be used interchangeably.^[6]

Word analogies

Word analogies are tricky when using concepts of information theory.

Any change of a string of text by nature of being a change is an increase of information entropy (or a loss of information content). This will be true if the string is a word ("rational" changed to "rasional") or is nonsense ("alkfd" to "alkfg"). (However, a proofreader, acting as an agent of natural selection, could reject erroneous copies of a text to retain the information entropy.)

In terms of Kolmogorov complexity, changes in letters can supply more or less information, but is dependent on the linguistic structure. The number of processes required to interpret the word through an algorithm may or may not depend on the number of letters and the identity of the letters, and hence "more" or "less" has little meaning. In the same way, mutations in genetics can potentially change how an organism develops, but without a complete understanding of the processes of development, a mutation is not "more" or "less" information.

An attempt to interpret a word analogy by both concepts at the same time fails because the two concepts are not independent but also not the same. It can be true that a change of a letter ("lost" to "post") is less copying fidelity (increased information entropy) and yet changes some linguistic meaning (different Kolmogorov complexity).

Information theory and genetics, evolution, and development

The relationship between biology and information theory given above and other approaches in the literature suggest that the words "biological information", "developmental information" or "genetic information" are ambiguous without clarification. Even then, there will be ambiguity:

“”In biology the term information is used with two very different meanings. The first is in reference to the fact that the sequence of bases in DNA codes for the sequence of amino acids in proteins. In this restricted sense, DNA contains information, namely about the primary structure of proteins. The second use of the term information is an extrapolation: it signifies the belief or expectation that the genome somehow also codes for the higher or more complex properties of living things. It is clear that the second type of information, if it exists, must be very different from the simple one-to-one cryptography of the genetic code. This extrapolation is based, loosely, on information theory. But to apply information theory in a proper and useful way it is necessary to identify the manner in which information is to be measured (the units in which it is to be expressed in both sender and receiver, and the total amount of information in the system and in a message), and it is necessary to identify the sender, the receiver and the information channel (or means by which information is transmitted). As it is, there exists no generally accepted method for measuring the amount of information in a biological system, nor even agreement of what the units of information are (atoms, molecules, cells?) and how to encode information about their number, their diversity, and their arrangement in space and time.^[7]

Creationist information theory

Creationists, in an attempt to coat their myths with a veneer of science, have co-opted the idea of information theory to use as a plausible-sounding attack on evolution. Essentially, the claim is that the genetic code is like a language and thus transmits information, and in part due to the usual willful misunderstandings of the second law of thermodynamics (which is about energy, not information), they maintain that information can never be increased.^[8] Therefore, the changes they cannot outright deny are defined as "losing information", while changes they disagree with are defined as "gaining information", which by their definition is impossible. Note that at no point do creationists actually specify what information actually is and often will purposefully not define the concept. The creationists tend to change their meaning on an ad hoc basis depending on the argument, relying on colloquial, imprecise definitions of information rather than quantifiable ones -- or worse, switching interchangeably between different definitions depending on the context of the discussion or argument.

Dr. Werner Gitt and In the Beginning was Information

Understanding that information theory has a relationship to genetics and evolution, creationists have used the language of information theory in an attempt to discredit evolution. Dr. Werner Gitt published a monograph In the Beginning was Information^[9] that creationists invariably refer to when arguing about information theory and evolution. Gitt's book is problematic in its structure and in its assertions about information theory.

Gitt separates the scientific version of information from other types. He singles out Shannon information as "statistical" and then partitions information into syntax, semantic (or "meaningful") information, pragmatic information, and apobetics. In doing so, he makes a number of claims about how genetics works. The text develops a number of statements which Gitt numbers as "theorems", as if the text were a mathematics textbook, and claims "[this] series of theorems which should also be regarded as laws of nature, although they are not of a physical or a chemical nature."

This form of argument is problematic on multiple accounts. First, theorems are usually mathematical statements based on postulates and definitions and take the form of propositional logic to prove such statements. Gitt does not state his assumptions and leaves many terms undefined. More problematically, the theorems themselves are not mathematical statements; his theorems are actually assertions. (His binning of Shannon information as statistical and the "lowest level" of information indicates Gitt's disdain for mathematics.) Second, theorems are the result of deductive logic, while scientific laws are the result of inductive logic based on observation. The two cannot be equated. Gitt does not refer to any observation in the development of his theorems, and hence, by definition they are not laws.^[10] It is unclear how to make statements about the natural world without any observation to support it. Third, as will be described below, it is an untestable model and hence cannot be deemed valid or invalid.

In essence, Gitt uses the language of mathematics and science, but does not perform a mathematical proof or employ the scientific method. Instead, he makes a number of assertions that cannot be validated, and Gitt's text is a poorly constructed rhetorical argument, not a scientific one.

Semantic or meaningful information

At the heart of Gitt's text is the concept of meaningful information. Gitt does not define semantic information, but instead he relies on references to hieroglyphics, language, computer programs. Hence, he generalizes in his theorems concepts of linguistics into genetics that are unjustified. Essentially, Gitt conflates concepts of the informal definition of information (such as knowledge in a book) with that of information theory to provide statements assertions meaningless to genetics. His statements provide examples.

"There can be no information without a sender." It is certainly true in the case of books and writing that a human entity must have written or typed the original source. A reasonably educated person has observed other people writing, and has written him or herself. However, applying that generalization to genetics is problematic. An intelligent source has never been observed to create a genetic code naturally, nor is there any inferential evidence that this occurs. (The only exception is, of course, scientists in the laboratory who have only recently done so.) To assume that there must be a sender or an intelligent source of information cannot be validated.
"It is impossible for information to exist without having been established voluntarily by a free will." Again, this makes sense in the case of writing, books, and computer programs because we observe others generating this type of information (or have ourselves). There is no evidence that during procreation a supernatural being is deciding which genes to pass on, or was the original source of a genetic code.

Books, language and computer programs do at times provide useful analogies to genetic information, but they are not relevantly similar when comparing their origin, and not every statement about information in books or computer programs can be generalized to genetic information.

Statements on evolution

Gitt concludes the following about evolution:

“”We find proposals for the way the genetic code could have originated, in very many publications [e. g. O2, E2, K1]. But up to the present time nobody has been able to propose anything better than purely imaginary models. It has not yet been shown empirically how information can arise in matter, and, according to Theorem 11, this will never happen.

"Theorem" 11 (deduced without postulates or definitions) states that

“”A code system is always the result of a mental process (see footnote 14) (it requires an intelligent origin or inventor).

Gitt basically uses an argument from ignorance to attempt to invalidate evolution and then uses Theorem 11, an invalid deductive statement (per the last section) based entirely on his model and not based on evidence, to entirely invalidate evolution. (Ironically, Gitt's model itself is "purely imaginary.") His statement on mutations is similar:

“”This idea is central in representations of evolution, but mutations can only cause changes in existing information. There can be no increase in information, and in general the results are injurious. New blueprints for new functions or new organs cannot arise; mutations cannot be the source of new (creative) information.

Unfortunately, without any measurement to back this up, his assertion about no increase of information cannot be validated. Gitt never defines meaningful information^[11], provides no way to measure it, or gives a qualitative sense of more or less. Hence, his proposition is untestable and unfalsifiable. Gitt has constructed his model such that the status quo is meaningful and anything that manipulates information that is not intelligent (or God) makes information less meaningful.

In summary, Gitt has convoluted deductive and inductive logic to generate an invalid model based on tenuous assertions based on a false comparison between DNA sequences and humanly produced texts and algorithms. The model is not based on observations of the natural world, despite making extraordinary claims about it. It makes statements about more and less information and yet the information cannot be quantified. The model is untestable and unfalsifiable. Overall, Gitt's model is worthless at describing information in the natural world.

Footnotes

↑ http://nar.oxfordjournals.org/cgi/reprint/28/14/2794
↑ See the Wikipedia article on weasel program.
↑ Creationists always seem to assume that science-y folks are using Shannon information, and then say it's wrong, such as in these exchanges between PZ Meyers and Michael Egnor. Of course, PZ Meyers wasn't talking about Shannon information in the first place.
↑ A fairly simple explanation of this form of information by Chaitin himself is here.
↑ See this blog post about this point.
↑ For an explanation of how creationists confuse different types of information, see this talkorigins letter.
↑ Nijhout, H. F. Bioessays, September 1990, vol. 12, no. 9; p.443
↑ Notably, this is a change from the tactic that there can be "no beneficial mutations." Because information theory is more difficult for the layman to understand, it is easy to hide behind information theory without really understanding it. It is also intimately related to the "evolution couldn't have possibly have made eyes, wings, or flagella" arguments as well.
↑ http://clv.dyndns.info/pdf/255255.pdf
↑ Even by Gitt's own commentary on laws in In the Beginning was Information, laws require observation, as he states, "The Laws of nature are based on experience."
↑ When asked what information is, Gitt writes "That [it] is not possible [to define information] because information is by nature a very complex entity. The five-level model [that Gitt developed] indicates that a simple formulation for information will probably never be found." Basically, Gitt is unwilling or too lazy to formulate a formal definition of information or meaningful information because it is too much work!

[1] ttp://nar.oxfordjournals.org/cgi/reprint/28/14/2794

[2] See the Wikipedia article on weasel program.

[3] Creationists always seem to assume that science-y folks are using Shannon information, and then say it's wrong, such as in these exchanges between PZ Meyers and Michael Egnor. Of course, PZ Meyers wasn't talking about Shannon information in the first place.

[4] A fairly simple explanation of this form of information by Chaitin himself is here.

[5] See this blog post about this point.

[6] For an explanation of how creationists confuse different types of information, see this talkorigins letter.

[7] Nijhout, H. F. Bioessays, September 1990, vol. 12, no. 9; p.443

[8] Notably, this is a change from the tactic that there can be "no beneficial mutations." Because information theory is more difficult for the layman to understand, it is easy to hide behind information theory without really understanding it. It is also intimately related to the "evolution couldn't have possibly have made eyes, wings, or flagella" arguments as well.

[9] ttp://clv.dyndns.info/pdf/255255.pdf

[10] Even by Gitt's own commentary on laws in In the Beginning was Information, laws require observation, as he states, "The Laws of nature are based on experience."

[11] When asked what information is, Gitt writes "That [it] is not possible [to define information] because information is by nature a very complex entity. The five-level model [that Gitt developed] indicates that a simple formulation for information will probably never be found." Basically, Gitt is unwilling or too lazy to formulate a formal definition of information or meaningful information because it is too much work!

[1]

[2]

[3]

[4]

[5]

[6]

[7]

[8]

[9]

[10]

[11]

@@ Line 10: / Line 10: @@
 In genetics, a point [[mutation]] increases the information entropy of a DNA base pair.  However, natural selection counteracts this increase through eliminating organisms with harmful mutations and consequent higher information entropy (or colloquially, lower information content).<ref>http://nar.oxfordjournals.org/cgi/reprint/28/14/2794</ref>  While information theory does not describe how a sequence of DNA bases is expressed into features for development, it clearly indicates that genetic information is transmitted from one generation to another mathematically.  Any feature of a string that preserves fitness will have a lower information entropy or higher information content than a random string.  [[Richard Dawkins]]'s weasel program that investigates cumulative selection shows a lowering of information entropy.<ref>{{wpa|weasel program}}</ref>
-While there are similarities between thermodynamic entropy and information entropy, the former refers exclusively to the distribution of energy.  Entropy increases in thermodynamics in a closed system according to the Second Law, but it is unclear what a closed system is genetic information.  At the least, natural selection provides feedback to the information entropy.
+While there are similarities between the mathematical form used to describe thermodynamic entropy and information entropy, the former refers exclusively to the distribution of energy.  Entropy increases in thermodynamics in a closed system according to the Second Law, but it is unclear that thermodynamic entropy and information have anything in common besides mathematical notation. Furthermore, the relevance{{fact}} of what the closed system encompasses, regarding genetic information, is unclear.  At the least, natural selection influences the propagation of genetic coding.
 Creationists ''really'' don't like this stuff, but won't say why.  It's likely because they don't want an ''actual'' definition of information that can be argued against.<ref>Creationists always seem to assume that science-y folks are using Shannon information, and then say it's wrong, such as in [http://scienceblogs.com/pharyngula/2007/02/egnor_responds_falls_flat_on_h.php these] [http://scienceblogs.com/pharyngula/2007/02/michael_egnor_comes_back_for_a.php exchanges] between PZ Meyers and Michael Egnor.  Of course, PZ Meyers wasn't talking about Shannon information in the first place.</ref>