The State of the Art in Evolutionary Computation

Editor’s note: Dr. Yampolskiy is Associate Professor in the department of Computer Engineering and Computer Science at the Speed School of Engineering, University of Louisville. In this series, he asks: “What Can and Can’t Darwin’s Algorithm Compute?” See previous posts in the series here:

In order to establish the state of the art in evolutionary computation I examined a number of survey papers^{43, 44} and seminal results^45-50 looking at produced human-competitive results, as they are meant to represent the greatest accomplishments of the field. While, on the surface the results may seem impressive, deeper analysis shows a complete absence of success in evolving non-trivial software from scratch and without human assistance. It is of course necessary to be precise about what it is we are trying to measure or detect, so as to avoid disagreements resulting from ambiguity in terms being used.

It may be difficult to formally specify what makes a piece of software non-trivial, but an intuitively attractive measure of the length of the program, expressed as the number of lines of code, is not a sufficient indicator of complexity, as it could have an extremely low Kolmogorov complexity.⁵¹ Inspired by the Turing Test^{52, 53}, which is based on an inability to distinguish output from a person and a computer, I propose defining non-trivial software as that which would take an average experienced human programmer at least a full hour of effort to produce if given the same problem to solve. If the solution source code could be produced with significantly less effort (for example, one minute), it may not be sufficiently complex and the problem may be deemed trivial for our purposes. This approach to specifying triviality would exclude “Hello World” and most other toy programs/problems from consideration, which is exactly what we wanted to achieve as the main benefit from being able to evolve software would come from ability to replace full time programmers.

Two Further Conditions

With regard to the other two conditions, they are much easier to specify. From “scratch” means that we are not starting with an existing version of a program (but are happy to rely on existing APIs, subject to the non-triviality of all newly produced code). Without human assistance can be interpreted to mean that the programmer is working alone, or a team of programmers is working an equivalent amount of time. For example, two programmers would each need at least forty minutes to solve the problem, which implies a small communication overhead.

Reading early claims about capabilities of EA (Evolutionary Algorithms) feels just like reading early predictions from AI literature⁵⁴. Some early success is projected into the future by assuming that the same rate of progress with continue and it is claimed that complete success is just years away. However, as with early AI, the claims are inflated, unsupported, overly optimistic, phrased in deceptive and metaphoric language, and the solutions do not scale to real world problems. Perhaps an EA “winter” is long overdue. Here is how Koza, presents the state of the field in 1994:

[I]n this article, we will present a number of examples from various fields supporting the surprising and counter-intuitive notion that computers can indeed by programmed by means of natural selection. We will show, via examples, that the recently developed genetic programming paradigm provides a way to search the space of all possible programs to find a function which solves, or approximately solves, a problem.¹⁵

Sixteen years later he reports results of what he calls an “extraordinarily long” experiment:

An additional order-of-magnitude increase was achieved by making extraordinarily long runs on the largest machine (the 1,000-node machine). … The length of the run that produced the two patentable inventions was 28.8 days — almost an order-of-magnitude increase (9.3 times) over the overall 3.4-day average for typical runs of genetic programming that our group had been making at the time.⁴³

One quickly realizes that most improvements in the field simply come from using more computing power to search progressively larger parts of the solutions space, a result similar to the one expected for random search algorithm.

Overhyped and Ambiguous Reporting

Here is an example of overhyped and ambiguous reporting of results from recent work on EA. Researchers Becker and Gottschlich⁴⁰ go from naming their paper “AI Programmer: Autonomously Creating Software Programs Using Genetic Algorithms,” to the abstract, “AI Programmer, that can automatically generate full software programs requiring only minimal human guidance,” to claiming that “Using AI Programmer, we were able to generate numerous complete software programs.” Finally in experimental results they state what they managed to produce: “A generated program that outputs ‘hello’ ” or performs addition operation.”⁴⁰ But even that is a bit of a hype: “Rather than starting with “Hello World”, we first had AI Programmer create a more simplistic program that simply output ‘hi.’ It was successful after 5,700 generations…”⁴⁰

Even this trivial result was not a clean success. “The generated program fulfilled its requirement to output the target text, but interestingly included subsequent random characters, which contained parsing errors, including nonmatching brackets.”⁴⁰ An identical program but the one printing “I love all humans” took 6,057,200 generations.⁴⁰

Brute Forcing Strings

Perhaps it is unfair to pick on this particular paper, which is only available as an unreviewed pre-print, but I selected it because it is highly representative of the type of work frequently published in GP, and its extreme claims make problems clear to identify. If its title was “Brute Forcing Strings” it would be a reasonable work on that subject, but like so many others, the authors claim to be “Autonomously Creating Software Programs” using evolutionary computation, a claim that is never substantiated in any published literature on this subject.

Here are similar results from the 2015 “General Program Synthesis Benchmark Suite” by Helmuth and Spector:

Of the seven problems on which PushGP found no generalizing solution, most are not surprising in that they involve extensive use of multiple programming constructs, the linking of many distinct steps, or a deceptive fitness space where fitness improvements do not lead toward perfect programs. We have written solutions to each of the unsolved problems by hand …³⁸

They go on to conclude:

While some problems can be solved with programs containing fewer than 10 instructions, none would likely be found using brute force search over our instruction sets within the number of program evaluations allowed here. Searching over size 5 programs using the Number IO instruction set would require evaluating over 7 billion programs, much more than the 5 million we used in our GP runs. Other problems have smallest known solutions of over 20 instructions using instruction sets with more than 100 instructions, to our knowledge beyond the reach of all other program synthesis systems.³⁸

Others have expressed similar skepticism:

“We examine what has been achieved in the literature, and find a worrying trend that largely small toy-problems been attempted which require only a few line of code to solve by hand.”³⁷
“A full understanding of open-ended evolutionary dynamics remains elusive.”⁵⁵
“A literature review has revealed that only a small fraction of the papers on GP deal with evolving TE computer programs, with the ability to iterate and utilize memory, while the majority of papers deal with evolving less expressive logical or arithmetic functions.”³⁷ “We conclude that GP in its current form is fundamentally awed, when applied to the space of TE programs.”³⁷ “Computer code is not as robust as genetic code, and therefore poorly suited to the process of evolution, resulting in a insurmountable landscape which cannot be navigated effectively with current syntax based genetic operators. Crossover, has problems being adopted in a computational setting, primarily due to a lack of context of exchanged code. A review of the literature reveals that evolved programs contain at most two nested loops, indicating that a glass ceiling to what can currently be accomplished.”³⁷
“There are many problems that traditional Genetic Programming (GP) cannot solve, due to the theoretical limitations of its paradigm. A Turing machine (TM) is a theoretical abstraction that express the extent of the computational power of algorithms. Any system that is Turing complete is sufficiently powerful to recognize all possible algorithms. GP is not Turing complete.”⁵⁶

“Time to Move On”

Even a survey of the GP community itself produced the following feedback regarding current problems being worked on:

“Far too many papers include results only on simple toy problems which are often worse than meaningless: they can be misleading”;
“([W]e should exclude) irrelevant problems that are at least 20 years old”;
“[G]et rid of some outdated, too easy benchmarks”;
“[T]he standard ‘easy’ Koza set should not be included”;
“[It is] time to move on.”³⁶

In practice GPs are used in the same way as GAs, for optimization of solutions to particular problems or for function optimization^{36, 37, 56-59} or for software improvement.⁶⁰

Tomorrow, “Possible Reasons for Limited Success of Evolutionary Algorithms.”

References:

Koza, J.R., Human-competitive results produced by genetic programming. Genetic Programming and Evolvable Machines, 2010. 11(3-4): p. 251-284.
Kannappan, K., et al., Analyzing a decade of human-competitive (“humie”) winners: What can we learn?, in Genetic Programming Theory and Practice XII. 2015, Springer. p. 149-166.
Clune, J., et al., Natural Selection Fails to Optimize Mutation Rates for Long-Term Adaptation on Rugged Fitness Landscapes. PLoS Comput Biol 2008. 4(9).
Ray, T.S., Evolution and optimization of digital organisms. Scientific Excellence in Supercomputing: The IBM 1990 Contest Prize Papers, 1991.
David, O.E., et al., Genetic algorithms for evolving computer chess programs. IEEE transactions on evolutionary computation, 2014. 18(5): p. 779-789.
Altshuler, E.E. and D.S. Linden, Wire-antenna designs using genetic algorithms. IEEE Antennas and Propagation magazine, 1997. 39(2): p. 33-43.
Bird, J. and P. Layzell, The evolved radio and its implications for modelling the evolution of novel sensors, in Proceedings of the IEEE Congress on Evolutionary Computation (CEC’02). 2002. p. 1836-1841.
Hauptman, A. and M. Sipper, GP-endchess: Using genetic programming to evolve chess endgame players, in European Conference on Genetic Programming. 2005, Springer: Berlin, Heidelberg. p. 120-131.
Kolmogorov, A.N., Three Approaches to the Quantitative Definition of Information. Problems Inform. Transmission, 1965. 1(1): p. 1-7.
Turing, A., Computing Machinery and Intelligence. Mind, 1950. 59(236): p. 433-460.
Yampolskiy, R.V., Turing test as a defining feature of AI-completeness, in Artificial Intelligence, Evolutionary Computing and Metaheuristics. 2013, Springer Berlin Heidelberg. p. 3-17.
Armstrong, S. and K. Sotala, How we’re predicting AI–or failing to, in Beyond Artificial Intelligence. 2015, Springer. p. 11-29.
Soros, L.B. and K.O. Stanley. Identifying necessary conditions for open-ended evolution through the artificial life world of chromaria. in H. Sayama, J. Rieffel, S. Risi, R. Doursat, & H. Lipson (Eds.), Artificial life 14: Proceedings of the Fourteenth International Conference on the Synthesis and Simulation of Living Systems. 2014.
Teller, A. Turing completeness in the language of genetic programming with indexed memory. in Evolutionary Computation, 1994. IEEE World Congress on Computational Intelligence., Proceedings of the First IEEE Conference on. 1994. IEEE.
Woodward, J.R., C.G. Johnson, and A.E. Brownlee. GP vs GI: if you can’t beat them, join them. in Proceedings of the 2016 on Genetic and Evolutionary Computation Conference Companion. 2016. ACM.
Poli, R., et al., Field guide to genetic programming. http://www.gp-field-guide.org.uk/
Woodward, J.R. GA or GP? that is not the question. in Evolutionary Computation, 2003. CEC’03. The 2003 Congress on. 2003. IEEE.
Orlov, M. and M. Sipper, Flight of the FINCH through the Java wilderness. IEEE Transactions on Evolutionary Computation, 2011. 15(2): p. 166-182.