# Testing Scientific Hypotheses Using Specified Complexity Editor’s note: We have been reviewing and explaining a new article in the journal BIO-Complexity, “A Unified Model of Complex Specified Information,” by George D. Montañez. For earlier posts, see:

How can one decide among competing explanations? In everyday life, we use an assortment of techniques to weigh and reject explanations for the phenomena we observe. For example, when walking into the kitchen and observing a trail of crumbs and an opened cookie jar, we can often infer that a child has enjoyed a before-dinner treat. When confronted, the child’s explanation that cookie monsters ate the missing treat is rejected without much reflection, given our knowledge of the world and its regularities.

Science requires us to be more precise when rejecting explanations. We aspire to quantify how plausible or implausible explanations are before rejecting them. For probabilistic explanations (namely those that propose some probabilistic process as an explanation for the event observed), we can precisely characterize how likely or unlikely an observation is under the proposed process, and we can reject the processes put forward as explanations (called hypotheses or models) which do not make the observation likely. Doing so rigorously is what we call statistical hypothesis testing.

## Rejecting Explanations and Low-Probability Events

When we observe an event we often seek to understand by what process it occurred. Given enough observations, we can formulate a model of the process, which for randomized events will typically be a probabilistic model. This model, created to explain what we observe, should render the observations reasonably likely (this, of course, is why the model was proposed in the first place). When a model proposed to explain our observations does not render the type of events we observe likely, we tend to reject that model as a plausible explanation of the events observed.

Consider a state lottery where someone related to a lottery official wins six jackpot lotteries in a single year. Under a “fair lottery” model, a past lottery winner is extremely unlikely to win another jackpot in the same year, let alone five more. Such an unexpected event will send us (and state regulators) looking for alternative explanations to this rare event, rejecting the “lucky fair lottery” model as a plausible explanation.

However, low probability alone cannot be used to reject models. Take, for example, a deck of 52 distinguishable playing cards, randomly shuffled and placed on a table in a row. Any particular random sequence of cards (called a permutation) has a probability of roughly 10 − 80 of occurring, which is about the same probability of taking a random electron from the known universe (somewhere from a gas cloud thousands of light years away, for example) and having someone else randomly choose the same exact electron out of all the possible electrons they could have chosen in the universe. If any probability can be considered small, surely this probability qualifies. This means the specific sequence of cards observed was extremely unlikely under the random shuffle model, even though that was the true explanation for the event.

This seeming paradox disappears when we ask “what is the probability of the event observed or any more extreme occurring?” This is a better question. Once we ask that question, we see that the probability that we would observe a sequence with a probability of 10 − 80 or smaller under the random shuffle model is actually 1! It is guaranteed to occur, since every permutation has the same small probability of occurring. Counter-intuitively, a set of enough small-probability events can add up to one large-probability event. So our observed event wasn’t actually surprising at all, when describing the event as “a low-probability sequence is observed.” This tells us that we should consider classes of surprising events when deciding whether or not to reject models as explanations.

## Grouping Surprising Events: P-Values

Scientists often make use of p-values (sometimes viewed as tail probabilities) in determining when models are poor explanations for observed events. The p-value is the probability of observing an event at least as extreme as the one actually observed, under the proposed model. When an event has very small p-value, this can be used as evidence against the proposed explanation.

P-values avoid the low-probability “paradox” we encountered earlier by considering not just the observed event but also any that is at least as extreme as the one observed. By doing this, when all events are low-probability events under a model (such as for a uniform probability distribution on a very large space), we will not reject the proposed model simply because we observe a low-probability event occurring: the p-value in that case will be large. However, if not all events are equally low-probability, yet we observe one (leading to an extremely small p-value under that definition of extreme), we would have evidence against the model in question.

We see that mere improbability of a particular outcome isn’t enough to justify rejecting a model. However, small tail probability is. P-values use “extremeness” of test statistics as a threshold to compute tail probabilities.

There are other ways to compute tail probabilities, using different functions to measure extremity. Specified complexity provides an alternative way of computing tail probabilities which can be used to reject proposed explanations of events.

## Specified Complexity and Hypothesis Testing

A recently published paper in BIO-Complexity (Montañez 2018) by machine learning researcher and computer science professor George D. Montañez makes the connection between specified complexity and statistical hypothesis testing explicit. This connection was independently discovered by Aleksander Milosavljević (Milosavljević 1993) and William Dembski (Dembski 2005), who both showed (to varying degrees) that specified complexity models can be viewed as hypothesis test statistics, similar to p-values. Montañez fleshes out this relation, demonstrating that every p-value hypothesis test has an equivalent specified complexity hypothesis test, and every canonical specified complexity model can be used to bound tail probabilities in a way similar to p-values. Remarkably, specified complexity hypothesis tests can also be used in some situations where p-values cannot be computed, such as when the likelihood of a particular observation is known but nothing else about the distribution outside of that value (whereas analytic p-values will typically require knowing something about the form or shape of the distribution).

This property of specified complexity models is not just of academic interest. The paper gives a table with computed cutoff values for specified complexity hypothesis tests, allowing applied researchers to make use of the computed values directly as they do with other statistical hypothesis test tables. This could allow specified complexity models to be used in fields other than intelligent design (as they have been with Milosavljević’s algorithmic significance method (Milosavljević 1993, 1995)), and to be used by those without advanced mathematical training.

## Bounding Tail Probability Using Specification Functions

As we saw, low probability is not enough to rule out explanations, but the combination of low probability and high specification is. The paper explains:

The addition of specification is what allows us to control these probabilities. Considering probabilities in isolation is not enough. While unlikely events can happen often (given enough elements with low probability), specified unlikely events rarely occur. This explains why even though every sequence of one thousand coin flips is equally likely given a fair coin, the sequence of all [tails] is a surprising and unexpected outcome whereas an equally long random sequence of heads and tails is not. Specification provides the key to unlocking this riddle.

What is it about observing large specification values (in conjunction with low probability of the observation) that produces small tail probabilities as a result? Montanez continues:

…[S]ince the specification values are normalized, few elements can have large values. Although many elements can have low probability values (thus making the occurrence of observing any such a low-probability event probable, given enough of them), few can have low probability while being highly specified.

Thus, we can think of specification functions as placing a form of “specification mass” over a space of possible outcomes, where this mass is conserved (as is real mass). We can concentrate the mass in one area of the space only by decreasing it in other areas. This implies that we cannot place large amounts of specification mass on many outcomes simultaneously. Thus, observing concentrated specification mass (i.e., a large specification value) on a single outcome makes it special and surprising: not every outcome can be like that. This not-every-outcome-can-be-like-that-ness is exactly what p-values and small tail probabilities are meant to capture. And because the outcome in question has low probability, we are assured the hypothesized process doesn’t favor it, making it occur often. This combination of low probability and high specification is what powers specified complexity hypothesis tests.

## Blindly Choosing Special Sequences

Having obtained a specification value for your recently discovered numeric sequence of prime numbers and learning that specified complexity models can be used as hypothesis test statistics, you decide that you’d like to rule out the hypothesis that the sequence was blindly chosen at random from among the space of all possible sequences of the first thirty-one positive integers. Of those, there are 31-11 possible sequences of the same length as your observed sequence. Under a blind uniform probability each of these has the probability 31-11 of individually occurring. Therefore, your probability estimate p(x) under the proposed model is 31-11. You combine this with your previous estimates of r and ν(x), to obtain a kardis value of

κ(x) = r [p(x) / ν(x)] = 600000·(31-11/1) ≈ 2.36 × 10-11.

Taking the negative log base-2 of this number, you obtain a specified complexity value of roughly 35 bits, which according to the table of test-statistic cutoff values given in the Montañez paper would allow you to reject the blind chance hypothesis at a significance level smaller than 0.0001 (actually, much smaller). Given that significance levels of 0.01 are often cited as grounds for rejection of a hypothesized model, you confidently reject the uniform chance hypothesis as an explanation for the sequence observed.

But what about other random and semi-random processes? Just because you were able to reject a uniform chance model for the sequence does not necessarily mean that some other randomized process, one that is not uniform, couldn’t be responsible. What would such an explanation need to look like to avoid rejection under such hypothesis tests? You remember a section in the paper dealing with minimum plausibility baselines that may be relevant, but before continuing on, a warm nuzzle from Bertrand your dog reminds you that it is time to call it a night. A new day will bring new opportunities for investigation, with the added benefit of having a well-rested mind and body.

## Bibliography

Dembski, William A. 2005. “Specification: The Pattern That Signifies Intelligence.” Philosophia Christi 7 (2): 299–343. https://doi.org/10.5840/pc20057230.

Milosavljević, Aleksandar. 1993. “Discovering Sequence Similarity by the Algorithmic Significance Method.” Proc Int Conf Intell Syst Mol Biol 1: 284–91.

———. 1995. “Discovering Dependencies via Algorithmic Mutual Information: A Case Study in DNA Sequence Comparisons.” Machine Learning 21 (1-2): 35–50. https://doi.org/10.1007/BF00993378.

Montañez, George D. 2018. “A Unified Model of Complex Specified Information.” BIO-Complexity 2018 (4). http://bio-complexity.org/ojs/index.php/main/article/view/BIO-C.2018.4.

Photo credit: Personal Creations, via Flickr (cropped).