Othello Writeup: Blaine Boman, Jenny Weisenberg

CS4771: Machine Learning 04-04-2000

GENETICALLY PROGRAMMED OTHELLO: A Terminal Set Spectrum

Blaine Boman [ bgb10 ]
Jenny Weisenberg [ jw282 ]

Introduction

For this project, we were given a Java implementation of Othello, including a package which allows for players to be evolved using genetic programming techniques. We also had available to us a computer player named EDGAR, and a Random Player which selects random moves. Each evolved player is represented by a tree consisting of primitives: binary arithmetic operators (+, -, *, /) and terminals. The terminals describe positions on the Othello board taken by a certain color, such as "black_corners" and "white_edges".

In our project, we:

explore the effect of increasingly large sets of terminals on evolved player fitness and performance
explore the relationship between fitness and performance against both EDGAR and the Random Player
consider the possibility that evolved players with very low ("better") fitness against EDGAR might be overlearned, and thus might be more likely to underperform against the Random Player
introduce a few of our All-Star players that perform well against both EDGAR and the Random Player

Motivation

It is clear that the choice of terminals is critical in being able to evolve a decent player. We were interested in doing some experiments to determine whether giving the GA a larger terminal set would necessarily make it perform better. Would increasing the terminal set size always make the GA run the same or better, knowing that the GA could either choose to take advantage of its new descriptive powers or know enough to ignore them if they did not help?
On the road to exploring the question above, training against EDGAR, we noticed that some of the evolved players with excellent fitness ratings against EDGAR did not do well against a Random Player, while some of the evolved players with more mediocre fitness ratings beat the Random Player every time. So, we expanded our project to include an exploration of the relationship between overlearning and fitness levels of evolved players. (low fitness could mean an overlearned player.)

Approach

We created five versions, V1-V5. Each version indicates a different set of terminals used. V(X+1) includes all of the terminals in VX, plus two more terminals, as indicated. The terminals included in V1-V4 are identical to those in the original GPOthello code. We added the terminals "white_near_near_corners" and "black_near_near_corners", defined to be the indicated positions in the image below:

Here is a table of the terminal sets used for each version:

VERSION	TERMINALS	CODE MODIFICATIONS
V1	black, white	Reduced original code
V2	black, white white_edges, black_edges	Reduced original code
V3	black, white white_edges, black_edges white_corners, black_corners	Reduced original code
V4	black, white white_edges, black_edges white_corners, black_corners white_near_corners, black_near_corners	Original code
V5	black, white white_edges, black_edges white_corners, black_corners white_near_corners, black_near_corners white_near_near_corners, black_near_near_corners	Expanded original code

For each version, we did two runs, one of which had a high Termination Fitness (75.0) and one of which had a low Termination Fitness (21.0). (The 21 was included because the fitness ratings tend to be in multiples of 10, so we wanted to include the 20.0s) The former was designed to produce players with relatively bad fitness ratings against EDGAR (while still beating him). The latter was designed to produce players with low (good) fitness ratings against EDGAR. We wonder whether or not a player with a low (good) fitness rating would perform better against random players as well or if it do worse because it is overfit to Edgar's particular playing style.
We kept most of the initialization values constant, most importantly, we kept a population of 200, and 10 good runs. We decided to turn the ComplexityAffectsFitness setting to FALSE, thereby avoiding extra variables which could cause confusion. This is especially important because having ComplexityAffectsFitness set to true might help avert overlearning, throwing out complicated trees. We wanted to explore the NATURAL results, so we allow for no automatic pruning of this overgrown shrubbery.

Results

We produced roughly 20-25 Othello players for each version of our experiment. (20-25 players for each terminal set.) We then evaluated their performance against EDGAR and against random players.

Here is a table of our result averages, which illustrates the effect of increasingly large terminal sets. Click here to see more details.

VERSION	AVERAGE FITNESS AGAINST EDGAR (lower rating is "better" fitness)	AVERAGE WINS AGAINST RANDOM PLAYER (out of 100 games)	AVERAGE NUMBER OF POINTS BY WHICH IT BEATS EDGAR
V1	141.75	68.30	6.65
V2	75.63	64.50	28.94
V3	76.25	65.50	30.85
V4	38.79	78.62	36.59
V5	35.24	62.67	39.57

As might be expected, the fitness rating (against EDGAR) decreases ("fitness gets better") as the terminal set gets larger. A larger terminal set is like a bigger vocabulary, enabling the GA to be more descriptive and develop a better player. Likewise, as the size of the terminal set increases, the evolved players, on the average, beat EDGAR by more points. One would have hoped that performance against the Random Player might improve with a larger terminal set, too, even though training was done against EDGAR. Our results did not support this, however, as you can see from the table, there was no such trend. We might attribute this to the fact that a larger terminal set could promote overlearning, producing a very specialized player which can beat EDGAR but which can't generalize to beat a Random Player.
The two largest decreases (improvments) in average fitness occurred between Versions 1 and 2 and Versions 3 and 4. Since version one had only the "white" and "black" terminals, it was not able to encode very much information about the current board state. So it was not suprising that V1 could not perform very well. With the addition of the "white_edges" and "black_edges" terminals in V2, some information regarding the general placement of pieces was able to be taken into account, thus drastically increasing the fitnesses of produced trees. The terminals added between Versions 3 and 4 were "white_near_corners" and "black_near_corners". One might expect the "corners" terminal (added in Version 3) to result in the highest fitness increase, as capturing the corners is surely the key to winning. After playing several games of Othello, however, one realizes that forcing your opponent to move in one of the "near_corners" positions is as important if not more important than capturing the corners themselves.

Here are some graphs of results.

Fitness vs. Random Wins

Disappointingly, the results above are pretty scattered, with no particularly interesting patterns. A naive thought might be to expect the dots to form sort of a diagonal scattering from the upper left hand corner to the lower right hand corner of each graph, indicating that lower (better) fitness performs better. But again, these players were trained against EDGAR, so a very feasible explanation for why that did not happen is because the players with very low fitness were overfitted to EDGAR's strategy, and thus are not generalizable.
The graph of Version 4 is interesting in that it seems like the players either did extremely well or extrememly poorly against the Random Player.

Fitness vs. Beats Edgar By

As expected, all of the graphs above show some sort of linear relationship between the fitness rating and the performance against EDGAR, with lower (better) fitnesses performing well against EDGAR. This is not surprising, as the players were trained against EDGAR. (Beats Edgar By is simply the number of points by which our evolved player beats EDGAR in a showdown. Negative values obviously mean that our player lost to EDGAR.)

Conclusions

We are heartened by the fact that the GA is in fact able to take advantage of increasingly larger terminal sets, and that giving it the power to be increasingly descriptive does produce increasingly skilled players. We have also learned that increasing the descriptive powers of the trees does not necessarily produce players which perform well against random players, as they might tend to be overfitted. This problem might be solved, however, by training our players against a fairly good non-deterministic player -- this might be more likely to result in more players which consistently beat both EDGAR and the Random Player.

Our study has made us curious about the questions of How helpful is each terminal? and Which are the most important terminals? We delved into that a bit while looking at the jump sizes between fitness ratings and performances between the two versions. It would also be interesting to study terminal sets of the same size to better answer this question. For example, it would be a good experiment to test out two terminal sets, both including white/black and white/black_edges, one including white/black_corners, the other including white/black_near_corners. This would be a more direct way to test the effectiveness/value of a specific terminal.

Meet Some Especially Notable Players

As suspected, all three of our All-Star players came from versions 4 and 5 and had "mediocre" fitness values. (Our data did include many players with fitness ratings of 0.0, but these did not perform as well against the Random Player, and thus could not achieve the envied rank of All-Star. Again, we might attribute this to overlearning.)

Allstar #1
Version: 5 Fitness: 20.0
Wins against random player (out of 100): 100
Score against Edgar: 59 to 4

Allstar #2
Version: 5
Fitness: 40.0
Wins against random player (out of 100): 95
Score against Edgar: 55 to 8

Allstar #3
Version: 4
Fitness: 25.0
Wins against random player (out of 100): 93
Score against Edgar: 59 to 5

Issues

Mysterious "black_edges" players: Several of our runs were very strange, for example, they would produce one good gif followed by endless gifs with the one-node tree "black_edges". We decided to discard those, as it was clearly an anomaly.
Players with zero fitness which lost to EDGAR: Along the same lines as the above issue, we got several players which achieved a "0.0" fitness rating (the best possible fitness rating), yet did not beat EDGAR. We discarded that data too, assuming they were "bad runs" that somehow got screwed up. (The black_edges cases mentioned above are included in this group.) It should not be possible for a player to lose to EDGAR and still get 0.0 fitness.
"Randomness" of the Random Player varied with the version of Java used. We noticed suspicious discrepancies between "random" results achieved using different jdk's. We solved this problem by only using one version, but it is still unnerving.
Long Runtime and Inexplicable Crashing were major issues that impeded our progress. We had to start and restart runs tens of times in order to get enough .gifs to use.