Wednesday, December 24, 2008

Catching up - SFABC

First order of business - catching all of my faithful readers up on my life.

Science Fiction Association of Bergen County - a great group of folks I met in July at a Barnes and Noble in Paramus. Thank you, BN for hosting that meeting! I'm now a regular member and a member of their writers group. I'm very happy to be writing fiction again!

Tuesday, December 23, 2008

New job, new post

Hey, it's been a while.

My only excuse for my long absence is that I've been in process of changing jobs for most of 2008. I've left PricewaterhouseCoopers and joined Deloitte and Touche.

So I hope to be sharing more with you, and more consistently, into the future.

Thursday, February 21, 2008

Topping the charts


Long time no blog, sorry!

So what have I been doing? Well, what I'd like to share with you is some hacking of the charting facility in ECJ.

ECJ's distribution contains some references to the inclusion of JFreeChart and iText, which will enable the production of "publication quality charts". OK, I'm all for it, but then come the caveats. "We don't use the Console app much." "We don't really use that."

So what we've got is the linkage to a good free package for charting, but only the sketchiest pointers on how to integrate it into your ECJ experiments.

There were two things about "publication quality" charts that I wanted to do right away. First, charts in journal articles often show multiple data series on the same chart. Second, data published in journal articles is usually averaged over a significant number of GA runs. ECJ doesn't contain an example of either of these things out of the box.

I've hacked together an example of the first functionality. Now I can chart the best of generation, worst, mean, and standard deviation all on the same chart.

I ran my BinInt experiment with the new chart to show that it worked. Wow! So much nicer than looking at the simple statistics!

One thing that jumps out immediately is that the experiment is wasting more than half its function evaluations because the population has converged. While we can guess this is happening when we see the best of generation fitness stall at a particular value, charting the worst, mean and standard deviation makes it completely obvious that before generation 45 (of 100) the entire population is a single genotype.

(Remember, this is a selectorecombinative GA experiment, so there is no mutation to rescue the population. We expect the population to converge well short of the optimum due to domino convergence and drift, and it does.)

Function evaluations are the currency of evolutionary algorithms. While many systems (including ECJ) let you stop a run if a known optimum is reached, stopping on a convergence criterion seems to be just as useful. I'm not sure why this is not more common, since the bookkeeping necessary is not that great an overhead. In general, you can trust the computational cost of the function evaluations to swamp the cost of the algorithm itself.

Now I'll start tinkering with the second issue, averaging over several runs. I'm going to hack this using the subpopulation mechanism of ECJ. Each subpopulation will be an independent run with a different random seed. However, the statistics gathering (and charting) mechanism can look across the values of each subpopulation and collect the statistics and average them. Is this the right way to do this? I'm not sure, but I'll try it!

Tuesday, February 12, 2008

The Illusion of Life

The Illusion of Life is actually the name of a beautiful book about the classic Disney style of animation and how it was acheived.

I'm going to use the phrase as a jumping off point to talk about GAs. Genetic Algorithms are explicitly designed to imitate the successful process of optimization seen all around us.

Simple GAs try to tear down the bells and wistles of real biology and use just the core ideas of evolution. Take a finite population, evaluate the fitness of each member. Preferentially select higher fitness members to reproduce or be copied forward. Allow some source of variation to create new population members during reproduction. The most common reproductive operators are mutation and crossover.

A choice of population size, selection algorithm, operators and representation that all work together will take even a simple GA a long way. But GAs are sometimes criticised as being unable (even in principle) to replicate the diversity we see in the natural world.

Most GAs working on an optimization problem are like a microbe evolving drug resistance in a test tube environment. No one is even attempting to recreate the variety of species and niches of real life.

Here are some things that are sometimes brought up as examples of how GAs can be made more "biological", in the belief that this will somehow make them better:
  • haploid genomes
  • inversion operator
  • introns

Now it may be that there are specific optimization problems where each of these tricks might be helpful, but none of them has been shown to be extremely useful.

Notice that these ideas focus on the nitty-gritty of DNA. I have my own list of things I think might be important to bring in to our GA models from biology, but these aren't on my list. Instead, here is my list:
  • embodiment - in time and space
  • coevolution - a fitness measure in which the other members of the population are explicitly considered, including the possibility of parasitism
  • development - genetic cascades, and some form of lifetime fitness measure
  • self regulation - in the form of hormones or neurons
  • sexual selection

I'm much more interested in the possibilities of a boost from ecology than biology. It may be that one day, very powerful GAs will drop some things that are now considered "essential" such as the population itself! Estimation of distribution algorithms are a step in that direction. Until then, I think we do have a lot to learn from Mother Nature, but I think we need to look in some different directions.

Monday, February 11, 2008

Sex, Lies, and Genetic Algorithms

Moving on from the BinInt problem, the next basic research problem to code in ECJ is the classic "deceptive" problem. In this kind of problem, the optimum point is surrounded by very low fitness points, while the points furthest from the global optimum have the second highest fitness. A hill climbing algorithm is tempted to follow the slope away from the global best towards the suboptimal points.
However, a selectorecombinative GA is not a hill climber. Its ability to find the optimum point is based on shuffling combinations of alleles, trying to find good sets.

If the deceptive subproblem is too big, the GA has little better than random chance of finding the optimum, so here also we will rely on building up a problem by concatenating several subproblems.

Based on calculations shown in The Design of Innovation, p. 160, I verified that my code for the trap function was in fact deceptive. Then I looked at a few other sources, including Franz Rothlauf's Representations for Genetic and Evolutionary Algorithms, and I saw that my code for the trap was basically the same as his.

Tuesday, February 5, 2008

Testing my BinInt implementation

As a sanity check, I scaled the size of my subproblems down to 1, and compared the performance of the BinInt and MaxOnes problems with identical parameters. Ta dah! Exactly the same results, as you would expect.

Now I'm dialing up the BinInt size gradually. The effect of scaling is obvious, as many bits converge too early, carried along on the coattails of the high value bits. This series of experiments is with a 0.0 mutation rate, to test the powers of a selectorecombinative GA in isolation. In a problem l bits in size, with population and number of generations also held at l for l^2 function evaluations in total, the best fitness over a run is not even reaching half the known optimum.

I'm also still learning lot's of basic things about ECJ, such as how to chart statistics using the gui console. Woot!

Thursday, January 31, 2008

ECJ - building a BinInt problem

I've gotten a little farther with my learning of ECJ.

The tutorial 1 provided with ECJ uses the MaxOnes as a sample problem. MaxOnes simply adds up the number of "1" bits in the genome of an individual to create a fitness for that individual.

BinInt changes the problem by reinterpreting the genome as a binary integer. In my version, I've written it so that you can cut the genome up into several integers. For example, a genome of 150 bits can be interpreted as 30 5-bit integers, 5 30-bit integers, or even 16 9-bit integers with one 6-bit integer.

BinInt is a class of problems that imposes two new kinds of difficulty on the GA. First, BinInt shows scaling. Different bits are worth exponentially more than other bits. Secondly, bits are related to each other to form sub-problems. All the bits of a particular sub-problem contribute to the solution of that part of the overall problem.

Scaling in particular challenges a GA. The large fitness contribution from just a few bits can cause individuals lucky enough to have those bits turned on early in the run to quickly dominate the population, even before the population has found the best arrangement for the other bits. This problem is called premature convergence.

I had fun building the BinInt problem in ECJ. It was straightforward to modify the MaxOnes problem fitness evaluation function. The learning came in when I wanted to create a custom parameter to control the sizes of the integers. I invoked the OTSOG (On The Shoulders Of Giants) principle, and asked for help from the ECJ discussion list, with very helpful and immediate answers. Thanks guys!

Thursday, January 24, 2008

Sketch of comics hero team

Oh yeah, I like comics and manga, too. Sorry I forgot to tell you earlier.

I visited an Oscar site yesterday, and was happy to see that Cate Blanchett was nominated for Elizabeth: The Golden Age. I feel really stupid for not realizing that this movie is a sequel.

Anyway, it got me thinking about a hero team set in the Elizabethean Age.

In the backstory, Elizabeth really is the Faerie Queen. The last test of a new Queen of Faerie is to seduce a mortal lover, but Elizabeth instead falls for the mortal and renounces her throne in Faerie. This choice angers a large number of powerful spirits, who were planning on using Elizabeth's control of two thrones (Faerie and England) to transform England into the New Atlantis. (Not the New Atlantis of Bacon, sorry.)

Our heroes:
Gloriana - Queen Elizabeth in disguise. Or is Elizabeth the disguise, and Gloriana her real identity?
Sir Francis Walsingham - the spymaster of Elizabeth in real life. Here, a bit more "hands on".
Dr. John Dee - a link between the scientific and the supernatural. Capable of a bit of wizardry with the help of the right artifacts.
The Golem of Prague - on loan to Dee from the Maharal of Prague, who made him. Think Ben Grimm with a Yiddish accent.
Thomas Phelippes - asistant of Walsingham. Inventor, cryptographer, etc. The Q to Walsingham's M. Deciphered the Book of Soyga for Dr Dee.

Our villains:
Phillip II of Spain - wants to conquer England and create his own version of the New Atlantis.
Duetia - beautiful faery, would be Queen of Faery if she could eliminate Gloriana.
John Kelley/Talbot - worked with Dr Dee but has fallen under dark influences.
Quetztlcoatl - the Aztec god, making a deal to stay alive in the New World's New World Order.
Talus - a mechanical man/animated suit of armor. Sometimes comic relief for being too literal.

I know you're asking already, who would win a Golem vs. Talus smackdown. Is the occult science of Dr Dee any match for the pure pagan fury of a God? Can Walsingham outwit Phillip? Can Gloriana save England from drowning beneath the waves if Duetia has her way? And what about Naomi?

Find out in our next issue!

Tuesday, January 22, 2008

ECJ - tutorial 1

I set up the Evolutionary Computation in Java (ECJ) system to run under Eclipse. Here is a brief report on using it to run the first tutorial provided with the system. This is the same tutorial covered in the YouTube videos mentioned previously on this blog.



The tutorial description assumes that you are typing in every line of the tutorial1.params file, but since it is provided I went straight to tweaking the different parameters.



The optimization problem in this tutorial is trivial to state. The genome is a set of unrelated bits which are counted, so maximum fitness occurs when all bits are 1, hence the name MaxOnes. As trivial as that sounds, a lot of basic research in GAs is conducted on this and similarly simple functions.



Out of the box, the parameters of interest are


  • crossover probability = 1.0 (we will always cross to get a new individual)

  • mutation probability = 0.01 (every bit has a 1% chance of changing)

  • genome size = 40 bits

  • population size = 40

  • generations = 100



With this setup, we get





Generation 32
Found Ideal Individual


It took 32*40=1280 function evaluations to find the ideal individual of all ones.


Upping the ante, if we double the genome size to 80


Generation 87
Found Ideal Individual

We found it, but doubling the size of the problem more than doubled the effort to solve it. It almost tripled it. That is disturbing.


Doubling once again to a genome of 160 bits, and the system cannot find the ideal individual, even in 1000 generations. Even if the population size is quadrupled to 160, the system fails.


What has happened? Have we reached the edge of evolution already? Is this all that GAs are capable of, solving trivial problems?


Happily, the answer is no. We are very far from the edge of evolution (if there is such a thing), but we are at the beginning of understanding how GAs work.


One of the great themes of search and optimization algorithms is the tradeoff between exploration and exploitation. When do we look in new areas for an optimum, and when do we try to find the best value in the local area of a solution we already know is "pretty good"?


In GA work, the operators of crossover and mutation are often characterized as typifying exploration and exploitation, respectively. I think this is an acceptable characterization wrt mutation, though it isn't the greatest way to think about genetic operators. Another way to think about mutation is to look at it as a kind of heat source, as in simulated annealing.

What happens if we scale back the mutation rate to 0.001 and go back to our original population size of 40?


Generation 219
Found Ideal Individual

Very interesting! What was happening in our earlier runs? The mutation rate was 1 out of 100. When the genome size was less than 100 bits, individuals could get crossed and then never suffer mutation on their way into the new population. But when the genome size went to 160, we were almost guaranteeing that every individual would suffer, on average, 1.6 bits changed. So the population was constantly dancing around the optimum, but never hitting it exactly.

If there is a final lesson to be drawn from this small example, it is that GAs are not about parameter tweaking. There is always a coupling between the probem and the parameters that needs to be understood.

Chipping Away at the Wierdness

Over at the Mindful Hack, Denyse O'Leary defends the study of prayer thusly

Anyone familiar with the placebo effect would consider $2.3 million to
study the power of prayer money well spent.

How should we understand this? Does Denyse really think God is nothing more than the placebo effect? If intercessionary prayer "works", there should be some measurable difference between the placebo effect and people who are prayed for.

I think Denyse would like to help people believe in the objective reality of God. She is, after all, not a journalist but an apologist. This kind of post does not advance that agenda.

However, it is interesting when looked at in the context of many of Denyse's posts on Mindful Hack, Uncommon Descent, and her Design of Life blog. In posts where Denyse tries to channel Carl Zimmer to actually write about science, she winds up writing sentences that carry assumptions such as the reality of deep time, an old earth, and common descent. Is she with Michael Behe on this, and against most of her crank readership?

Also more out of the closet recently is DaveScot, Uncommon Descent's resident dominatrix. Dave fessed up to the same trio in a few posts over there recently, but has saved his crank street credibility by backing some form of front loaded evolution.

Are we seeing some of the loudest cheerleaders of ID talking themselves down? One can only hope.

A Model of Error

Over at UD, math (sic) professor Granville Sewell resurrects one of the silliest posts of 2006, Gil Dodgen's misapprehension of what makes a model. No, the errors in our OS are not part of the model, any more than they are part of your PDE solver or Dr Dembski's MESA or Dr Marks' EIL work.

It is however true that if you are running intense numerical calculations, you need to take into account the quality of your (pseudo)random number generator, your floating point hardware's aptitude for error due to heat, cosmic rays, bad programming by Intel, etc. Taking these things into account is very different from intending them to be part of the model system. No one is going to be impressed by results that need to be run on Windows ME with a lump of uranium lying on the desk next to the CPU.

But it is from this support the Dr Sewell makes a sweeping statement, "Unintelligent forces simply can’t do intelligent things..." Sorry, Dr Sewell. In the words of the prophet, pwned! May I be the first to introduce you to the Humies.

If you're not following the link, I'll just summarize that the Humies are cash based awards for achievement equal or better than human at some intellectual task, by an EA based system. That we can give away awards like this is a testament to the power that GA, GP, etc have already achieved.

Sunday, January 20, 2008

Paprika

I'm not a complete otaku, which is probably why it has taken me until now to notice this movie.

If you appreciated the way The Matrix played with "is this reality?" than you will love++ this anime. The mix of traditional cel animation and cgi was well done and the music was excellent. The playfullness of the OP (opening credits) was amazing and set the tone for the whole movie. Buy it on DVD!

Can't wait for Spring!

My recent look into science standards here in New Jersey popped up the factoid that the Garden Sate has a state dinosaur (as do several other states). Now I've found that the best place to look for fossils here is Poricy Park. Thank you NJGS!

Since you have to wade out into the stream to seive for fossils along the sandbanks, I'll have to wait for the spring before dragging the family out for the fun. This is gonna be great!

Friday, January 18, 2008

Looking back into ANAS

ANAS is John Holland's 1975 monograph, Adaptation in Natural and Artificial Systems. I went back to it recently to review the history of the thinking about linkage learning from the very beginning.

I was struck that Holland's first presentation of a simple genetic algorithm, called a reproductive plan in ANAS, is a steady state GA. In steady state GA, a single individual is replaced at a time in the population. This means that new mixes of genetic material is made available immediately for selection.

I had mis-remembered this. I thought that historically GAs started as generational models in which the entire population was replaced at once, and that steady state GAs were due to Grefenstette.

Similarly, ANAS considers the possibility of evolving algorithms, not just parameter sets. This is the basic distinction between genetic algorithms and genetic programming, which most people attribute to John Koza.

There is a phrase in Pirkei Avot (Ethics of the Fathers) "hufach bo v'hufach bo d'kulla bo." Turn it over, turn it over, because everything is in it. I think it applies just as well to basic texts in the sciences as it does to the Talmud.

Thursday, January 17, 2008

NJ ASK - WTF?

So I was watching (via PT 'natch) the slow motion train wreck of Florida school districts signing up to stupidity, consigning a generation of graduates to burger flipping and telemarketing hell.


(Btw, if they ever figure out how to outsource burger flipping to India, we are toast. That is, until they figure out how to outsource toast to India...)


So there I was, all smug in my insufferable suburban Yuppie way, when it suddenly occurred to me that the fair state of New Jersey has its own share of rural ignorance, people from one end of the state who impose blue laws on the other end, teachers in Kearny that lecture kids on their private religious beliefs, etc. Perhaps I should check out how my own state is doing?


The good news - New Jersey has some (IMHO) appropriate core content standards in place for life science education in evolution. New Jersey got an A while Kansas was getting an F- (pwned!) a couple years ago when KS was devolving into BS. The standards went into effect in 1996 and are reviewed every five years. Testing shows that even economically disadvantaged kids are scoring above 50% "proficient" at all grade levels.


The better news - we have a state dinosaur! (OK, OK even Florida has a state fossil.)

Back to smug complacency until 2011, when our standards come up for review...

Wednesday, January 16, 2008

Why Now?

In the interests of full disclosure, let me admit that I have been banned at Uncommon Descent. It is something of a badge of honor...

So one of the reasons for this blog is to provide a platform for comment and Kremlinology on the denizens of UD and the discussions there.

This time around, let us wonder at the recent post by Dr. Dembski himself. Yes, what is the predictive power of ID?

cue crickets chirping...

Well, that wraps up our program for today!

Now, for the Kremlinology portion of our show - what is the subtext of this message?

First, it is a way to say "I am important". Normally, important people don't have to broadcast that fact, so this is really telling us that Dr. Dembski's importance is falling rather than rising.

Second, it tells us that someone in the media is actually taking Darwin Day seriously. There is no other reason that anyone would take Dr. Dembski seriously enough to put him on TV (other than with Steve Colbert). Dr. Dembski is the "counterpoint" position for a Darwin program. Well, for a Fox program he might be the "point" position...

Third, it says Dr. Dembski's estimation of his chances of actually getting on the program are low. After all, actually appearing on the show is much more impressive than telling your friends that you might be invited on the show. But if this is as good as it is going to get, he's got to go with it.

Fourth, what is it with WaD and premature ejaculation? The guy cannot keep his mouth shut until the real results come in, he has to start blabbing at the very beginning. If you want other examples, look at the Polya Center and EIL fiascos. The guy knows how to kill his own prospects with ill timed PR better than anyone I know.

Tuesday, January 15, 2008

What Part of Evolution Don't You Undertand?

We're coming up on the Darwin Bicentennial, hooray! In honor of the occasion, I'll throw out my own little explanation of evolution for the sake of anyone dazed and confused by the massive amount of misinformation wandering the Internet.



Evolution is the name of a process. It is what happens when the following are all true for some period of time.
  • There is a population of individuals, and these individuals have traits.

  • The traits influence the success of the individuals in reproducing.

  • There are limited resources available for reproduction.

  • Not everyone that wants to reproduce will get the chance.

  • Chances depend on traits.

  • The traits can vary when individuals reproduce.

If we compare the population after some long period with the initial population, the frequency of the traits will have changed. Traits that helped reproduction will have increased their frequency at the expense of alternatives.


There is nothing in this that limits evolution to biology. The whole point of a field like Genetic Algorithms is that I can set up those conditions in software and evolve data. This is a very successful part of computer science, with many commercial applications.


Of course, it does happen that evolution applies to the biology that we see around us. If you don't think it does, you'll have to point out which of the above conditions doesn't hold in the real world.


Getting back into GAs

I don't remember how I discovered genetic algorithms. At some point, I got a copy of David Goldberg's book, Genetic Algorithms in Search, Optimization, and Machine Learning. It was a great book, and I had a lot of fun writing my own GA package in C (the code in the book was written in Pascal).

Goldberg's next book, The Design of Innovation is even better. The discussion of how to make GAs a principled engineering approach to solving problems, rather than a black box with too many knobs and levers, really energized my interest.

So now I'm learning the ECJ system using Eclipse. With the help of some nice YouTube tutorials on setting up ECJ with Eclipse, I'm off to the races!

Hello World

Don't expect much from this blog. I'm not huge on reguritating my life into an open diary, and I'm not the author of bons mots every day.



I'm going to try to focus the blog on the intellectual areas I care most about. Right now, this means genetic algorithms, cryptography, defending evolution from misguided attacks, XBRL, and a very small dab of politics. Arts and letters will get a passing mention. I'm interested in comics (and manga, and anime), backing into the subect from the perspective of UI design, Tufte, Scott McCloud and Takahashi Rumiko.

Thanks for reading! I hope you'll be back.