Machine Learning Homework 4 - GPOthello Experiment

A Simple Approach to Genetic Programming Cost Minimization.

Paolo de Dios and Stephen Jan
Columbia University, w4771 Machine Learning
{pd119, sj178}@columbia.edu

Abstract

Determining the right hypothesis representation is a difficult task. Perhaps the hardest part of deriving the proper hypothesis, especially in genetic programming applications is determining whether or not the proper attributes are properly represented. The accuracy of this hypothesis representation directly affects the computation time and effectiveness of genetic programs.

In this experiment, we attempt to utilize human derived strategies in order to properly represent the attributes necessary to derive a hypothesis reprsentation that can be more quicly generated. The goal of this experiment is to test one method for increasing the learning rate and reducing the computation time in genetic programming applications. This technique is geared towards the game Othello.

Introduction and Motivation

Genetic programming is an effective, but costly learning method that optimizes a problem solution by simulating biological evolution. It performs program or hypothesis evolution via a series of crossovers and mutations on a "population" of function trees that reprsent various configurations of the hypothesis. Candidates for "genetic propagation" are selected through probabilistic, tournament, randomized or greedy selection algorithms according to some fitness measure.

These genetic algorithms model evolution well. They attempt to probabilistically produce a diverse function population. The problem with these methods is that that they are computationally intensive applications. In order to produce succesively optimized hypothesis representations, genetic algorithms must be able to produce fit individuals that meet or exceed a certain watermark for for performance. Since these function trees are probabilistically selected, the performance of the resulting function trees can vary every generation. They may or may not meet the fitness criteria for selection, thus protracting the evolution process in order find a set of fit individuals for the next generation.

There are a number of ways to approach this problem. An obvious method would be to change the genetic algorithm itself to be less computationally intensive. The difficulty of that approach is that it may require mathematical rigor to find an alternative method that can accurately model biological evolution. Another method is to modify or select a less stringent fitness criteria. Once again, this method is difficult to model because it is dificult to model how well a certain function tree is going to perform at a given fitness level. It may very well require great amounts of testing and tweaking to get just right.

In the interest of time and cpu resources, we have opted to go with a simpler approach. We decided to optimize the hypothesis represntation by examining human learning performance against EDGAR in Othello. The idea is that since humans are faster learners than computers, it would be better to use humans to assist in deriving a more accurate hypothesis representation to help get the genetic programming system off to a good start. The hope is that either a smaller or more accurate hypothesis representation would translate to better performance. A smaller hypothesis space would reduce the time it takes to generate function populations. A more accurate representation would increase the performance against Edgar.

Experimental Setup

One of the biggest problems with extracting the hypothesis representation from human players is that their strategies vary across the board. In order to constrain this experiment, we simply placed human strategies in terms of new function tree terminals that express the general human strategy in terms of a the board representation.

GPOthello and gpjpp SymbReg Settings
Population Size	100
Number of Generations	51
CrossoverProbability	90.0
CreationProbability	20.0
CreationType	2
MaximumDepthForCreation	6
MaximumDepthForCrossover	17
MaximumComplexity	100
TournamentSize	10
SwapMutationProbability	20.0
ShrinkMutationProbability	0.0
AddBestToNewPopulation	true
SteadyState	false
DemeticGrouping	true
DemeSize	50
DemeticMigProbability	100.0
TerminationFitness	126
GoodRuns	5
UseADFs	false
TestDiversity	true

GPOthello_v_Edgar_trial1

SelectionType = Probabilistic
ComplexityAffectsFitness = false

GPOthello_v_Edgar_trial2

SelectionType = Tournament
ComplexityAffectsFitness = false

GPOthello_v_Edgar_trial3

SelectionType = Tournament
ComplexityAffectsFitness = true

GPOthello_v_Edgar_trial4

SelectionType = Probabilistic
ComplexityAffectsFitness = true

Fitness Measure

We needed to represent the performance of the GP relative to EDGAR's performance. It seemed like a more accurate measure of performance. To faciliate this, we modified the fitness measure to represent the number of pieces EDGAR had. The performance of the GP was therefore based on how well it minimized EDGAR's performance. For the purpose of this experiment, we set the "Termination Fitness" gpjpp variable to 126. Over the course of the default 5 games run, this means that we would terminate a run when we find a program that leaves EDGAR on average with around 25 points..just enough to narrowly beat EDGAR. This saves the GP system computation time since it does not have to proceed with calculating the next generation.

Human Performance Monitoring and Evaluation

We had 3 test subjects play against edgar for varying number of times. The test subjects did not know how to play othello from the beginning and were asked to mention when they changed their strategies by adding or removing a feature they were looking for, as denoted by the +'s and -'s in the table. We asked them for the features they looked for as a guidleline for our implementation.

Test Environment

java version "1.2.2" Classic VM (build JDK-1.2.2-W, native threads, symcjit)

Results

Human Performance

Test Subject 1	Human	EDGAR
1	30	34
2	31	33
3	24	40
4	15	49
5	28	36
6	19	45
7	31	33
8	20	44
9	16	48
10	18	46
11	14	50
12	22	42
13	28	36
14	32	32

Test Subject 2	Human	EDGAR
1	20	44
2	19	45
3	23	41
4	24	40
5	17	47
6	31	33
7	34	30
8	26	38
9	14	50
10	21	43
11	24	40
12	18	46

Test Subject 3	Human	EDGAR
1	9	55
2	16	48
3	33	31
4	34	30
5	17	47
6	27	37

From the 3 test subjects we derived that the most important strategy was to control the edges and corners. It should also be mentioned tat edgar employs this strategy. So it may be possible that our human subjects are just learning from EDGAR.

New Terminal representation

In addition to the near corner representions we added

( white_near_edges )
( black_near_edges )

Near edge include all spaces that border an edge but is not an edge itself, any space one cell away from the edge. near corner includes all spaces that are adjacent to a corner.

Unmodified Representation

Candidates below the fitness threshold defeated EDGAR. Given enough training the current hypothesis representation can defeat EDGAR. The time it takes to produce such a GP is quite long though, taking as much as 310 seconds to generate a single genration of hypotheses. These results form the baseline performance benchmarks that we wanted to improve upon.

Trial 1 - Generation 1: best 44: worst 48
GP#	dad	fitness	len	dep
44	44	105.0	19	6

RPB: ( + ( / ( - white_near_corners black ) black ) ( / ( - black_corners white_edges ) ( * ( * black ( * white white_near_corners )) ( - black black_edges ))))

Run time 616.75 seconds, 307.20 seconds/generation
BLACK wins. The final score is WHITE 26 BLACK 38.
GPOthello_v_Edgar_trial2_tree1.gif

Trial 2
GP#	oper mut	fitness	len	dep
11	RPB:3	108.0	13	5

RPB: ( / white ( - ( * black_edges ( + white black_corners )) ( - ( + black_edges black_edges ) white_edges )))

Run time 1822.47 seconds, 305.28 seconds/generation
BLACK wins. The final score is WHITE 29 BLACK 35.
GPOthello_v_Edgar_trial4_tree1.gif

Reduced Terminal Set Representation

Terminal Set consisted of:

( white_near_edges )
( black_near_edges )
( white_near_corners )
( black_near_corners )

Candidates below the fitness threshold defeated EDGAR. It took much less time to generate a GP that can beat EDGAR. As opposed to the original representation, the reduced terminal set beats EDGAR by a much larger margin. This cost savings can be attributed to the fact that the reduced hypothesis space requires much less computation to mutate and generate successor generations. Compared to the original hypothesis representation the use of strategy oriented GP terminals resulted in a run time reduction in excess of 66% and game performance increase of about 270%.

	Average Run Time Margin (lower is better)	Average Game Performance Margin (hgher is better)
Original	306.0	9
Modified	103.4	25
Improvement	66.4% run time reduction	270% performance improvement

Selected Runs (Best Performers)

Trial 1
GP#	dad	fitness	len	dep
0		75.0	7	3

RPB: ( / ( - white_near_edges black_near_edges ) ( * black_near_edges white_near_corners ))

Run time 3616.17 seconds, 68.24 seconds/generation
BLACK wins. The final score is WHITE 15 BLACK 49.
GPOthello_v_Edgar_trial2_tree1.gif

Trial 2
GP#	dad	fitness	len	dep
0		75.0	9	4

RPB: ( - ( - ( + white_near_edges black_near_corners ) ( + white_near_corners black_near_edges )) white_near_corners )

Run time 7145.98 seconds, 137.45 seconds/generation BLACK wins.
The final score is WHITE 19 BLACK 45.
GPOthello_v_Edgar_trial2_tree2.gif

Trial 3
GP#	dad	fitness	len	dep
75	75	113.0	3	2

RPB: ( - white_near_edges black_near_edges )

Run time 208.77 seconds, 102.80 seconds/generation
BLACK wins. The final score is WHITE 22 BLACK 42.
GPOthello_v_Edgar_trial4_tree1.gif

Trial 4
GP#	dad	fitness	len	dep
79	70:RPB:1	115.0	5	3

RPB: ( - ( * white_near_edges white_near_edges ) black_near_edges )

Run time 211.22 seconds, 105.38 seconds/generation
BLACK wins. The final score is WHITE 22 BLACK 42.
GPOthello_v_Edgar_trial4_tree2.gif

Increased Terminal Set Representation

Candidates below the fitness threshold all lost to EDGAR despite the fact that they had the best fitness score. This demonstrates the fact that an accurate and terse hypothesis representation is necessary in order to produce effective GPs. The addition of terminals probably just increased the "noise" in the GP and inserted false criteria for fitness evaluation. Although a more specific hypothesis representation resulted in faster generation of fit GPs it did not translate directly to good performance against EDGAR. Perhaps an alternate fitness measure is in order here to account for the increased number of function tree terminals.

	Run Time Margin (lower is better)	Game Performance Margin (hgher is better)
Original	306.0 sec	9
Modified	214.4	-24
Improvement	30% run time reduction	-266% performance degradation

Selected Runs (Best Performers)

Trial 1
GP#	dad	fitness	len	dep
2	27	40	15	4

RPB: ( / ( + ( - black white ) ( - white_near_edges white_edges )) ( + ( - white_corners black_corners ) ( / white_near_edges 10 )))

Run time 425.72 seconds, 220.70 seconds/generation
WHITE wins. The final score is WHITE 41 BLACK 23.
GPOthello_v_Edgar_trial2_tree1.gif

Trial 2
GP#	dad	fitness	len	dep
36	64:RPB:7	70.0	9	5

RPB: ( + white_near_edges ( + white_near_corners ( * black_corners ( + white_near_edges white ))))

Run time 424.23 seconds, 216.58 seconds/generation
WHITE wins. The final score is WHITE 50 BLACK 14.
GPOthello_v_Edgar_trial2_tree2.gif

Trial 3
GP#	dad	fitness	len	dep
61	59:RPB:20	75.0	35	8

RPB: ( + ( - ( * ( / white_edges black_corners ) ( * black white_edges )) ( * ( / white_edges black ) ( * white_near_corners black ))) ( - black_edges ( - ( - ( + ( * white_near_corners 10 ) ( / ( * black white ) ( * white_edges black_corners ))) white_near_edges ) ( + black_near_edges white_near_edges ))))

Run time 679.52 seconds, 231.64 seconds/generation
WHITE wins. The final score is WHITE 43 BLACK 21.
GPOthello_v_Edgar_trial4_tree1.gif

Trial 4
GP#	dad	fitness	len	dep
93	86:RPB:2	82.0	7	4

RPB: ( + ( * ( - white_near_edges white_edges ) 10 ) black_corners )

Run time 1378.02 seconds, 189.72 seconds/generation
WHITE wins. The final score is WHITE 44 BLACK 20.
GPOthello_v_Edgar_trial4_tree2.gif

Conclusion

As one can see, there is no obvious trend of improved performance for a human agent, the performance of a person fluctuates. The performance seems extremely erratic. We attribute this to these possible factors:

1) human player looses general concentration
2) human player focues on one feature and forgets to focus on another
3) human player loses morale (ie if edgar takes a corner)
4) human player modify strategy as he plays

We also beleive that whenever the hypothesis is changing, the performance inherently suffers in the short run, but improves in the long run. Test user 1 illustrates this point. the last 4 games were a contant improvement, where previous games had more sporratic improvement streaks.

Our approach was fairly effective. We reduced the tree representation in accordance to what humans beleived to be important features. The features we selected were: near edges and near corner spaces, nothing else. using only these 4 features, we managed to derive a GP agent that could beat EDGAR taking 34% less copmutation time.

We tried adding the new terminals without removing the default terminals. The runtime performance was better, but it was not able to generate an agent that could beat EDGAR. We conclude that a more complex terminal representation added more noise to the function tree representation. It can be concluded that a winning startegy is not directly related to number of the features. Instead, the selection of these features is non-arbitrary. One effective means to approach feature specifitcation seems to be through datamining on humans or expert systems.

As aforementioned the features used to train the GPOthello player were derived from human subjects. These human test subjects were trained against EDGAR. Their distilled strategy, as encapsulated by the four terminals were specialized to beat EDGAR. Unlike EDGAR, these strategies do not generalize to play human opponents. This can be verified by actually playing the GP agent. Initial test results indicate that every human that has played it can defeat it quite easily.

Appendix

Play against the trained GP that spanked EDGAR <here>
Zip archive of our results <here>