CS4771: Machine Learning 04-04-2000


Blaine Boman [ bgb10 ]
Jenny Weisenberg [ jw282 ]


For this project, we were given a Java implementation of Othello, including a package which allows for players to be evolved using genetic programming techniques. We also had available to us a computer player named EDGAR, and a Random Player which selects random moves. Each evolved player is represented by a tree consisting of primitives: binary arithmetic operators (+, -, *, /) and terminals. The terminals describe positions on the Othello board taken by a certain color, such as "black_corners" and "white_edges".

In our project, we: Motivation

It is clear that the choice of terminals is critical in being able to evolve a decent player. We were interested in doing some experiments to determine whether giving the GA a larger terminal set would necessarily make it perform better. Would increasing the terminal set size always make the GA run the same or better, knowing that the GA could either choose to take advantage of its new descriptive powers or know enough to ignore them if they did not help?
On the road to exploring the question above, training against EDGAR, we noticed that some of the evolved players with excellent fitness ratings against EDGAR did not do well against a Random Player, while some of the evolved players with more mediocre fitness ratings beat the Random Player every time. So, we expanded our project to include an exploration of the relationship between overlearning and fitness levels of evolved players. (low fitness could mean an overlearned player.)


We created five versions, V1-V5. Each version indicates a different set of terminals used. V(X+1) includes all of the terminals in VX, plus two more terminals, as indicated. The terminals included in V1-V4 are identical to those in the original GPOthello code. We added the terminals "white_near_near_corners" and "black_near_near_corners", defined to be the indicated positions in the image below:

Here is a table of the terminal sets used for each version:

V1 black, white Reduced original code
V2 black, white
white_edges, black_edges
Reduced original code
V3 black, white
white_edges, black_edges
white_corners, black_corners
Reduced original code
V4 black, white
white_edges, black_edges
white_corners, black_corners
white_near_corners, black_near_corners
*Original code*
V5 black, white
white_edges, black_edges
white_corners, black_corners
white_near_corners, black_near_corners
white_near_near_corners, black_near_near_corners
Expanded original code

For each version, we did two runs, one of which had a high Termination Fitness (75.0) and one of which had a low Termination Fitness (21.0). (The 21 was included because the fitness ratings tend to be in multiples of 10, so we wanted to include the 20.0s) The former was designed to produce players with relatively bad fitness ratings against EDGAR (while still beating him). The latter was designed to produce players with low (good) fitness ratings against EDGAR. We wonder whether or not a player with a low (good) fitness rating would perform better against random players as well or if it do worse because it is overfit to Edgar's particular playing style.
We kept most of the initialization values constant, most importantly, we kept a population of 200, and 10 good runs. We decided to turn the ComplexityAffectsFitness setting to FALSE, thereby avoiding extra variables which could cause confusion. This is especially important because having ComplexityAffectsFitness set to true might help avert overlearning, throwing out complicated trees. We wanted to explore the NATURAL results, so we allow for no automatic pruning of this overgrown shrubbery.


We produced roughly 20-25 Othello players for each version of our experiment. (20-25 players for each terminal set.) We then evaluated their performance against EDGAR and against random players.

Here is a table of our result averages, which illustrates the effect of increasingly large terminal sets. Click here to see more details.

(lower rating is "better" fitness)
PLAYER (out of 100 games)

As might be expected, the fitness rating (against EDGAR) decreases ("fitness gets better") as the terminal set gets larger. A larger terminal set is like a bigger vocabulary, enabling the GA to be more descriptive and develop a better player. Likewise, as the size of the terminal set increases, the evolved players, on the average, beat EDGAR by more points. One would have hoped that performance against the Random Player might improve with a larger terminal set, too, even though training was done against EDGAR. Our results did not support this, however, as you can see from the table, there was no such trend. We might attribute this to the fact that a larger terminal set could promote overlearning, producing a very specialized player which can beat EDGAR but which can't generalize to beat a Random Player.
The two largest decreases (improvments) in average fitness occurred between Versions 1 and 2 and Versions 3 and 4. Since version one had only the "white" and "black" terminals, it was not able to encode very much information about the current board state. So it was not suprising that V1 could not perform very well. With the addition of the "white_edges" and "black_edges" terminals in V2, some information regarding the general placement of pieces was able to be taken into account, thus drastically increasing the fitnesses of produced trees. The terminals added between Versions 3 and 4 were "white_near_corners" and "black_near_corners". One might expect the "corners" terminal (added in Version 3) to result in the highest fitness increase, as capturing the corners is surely the key to winning. After playing several games of Othello, however, one realizes that forcing your opponent to move in one of the "near_corners" positions is as important if not more important than capturing the corners themselves.

Here are some graphs of results.

Fitness vs. Random Wins

Disappointingly, the results above are pretty scattered, with no particularly interesting patterns. A naive thought might be to expect the dots to form sort of a diagonal scattering from the upper left hand corner to the lower right hand corner of each graph, indicating that lower (better) fitness performs better. But again, these players were trained against EDGAR, so a very feasible explanation for why that did not happen is because the players with very low fitness were overfitted to EDGAR's strategy, and thus are not generalizable.
The graph of Version 4 is interesting in that it seems like the players either did extremely well or extrememly poorly against the Random Player.

Fitness vs. Beats Edgar By

As expected, all of the graphs above show some sort of linear relationship between the fitness rating and the performance against EDGAR, with lower (better) fitnesses performing well against EDGAR. This is not surprising, as the players were trained against EDGAR. (Beats Edgar By is simply the number of points by which our evolved player beats EDGAR in a showdown. Negative values obviously mean that our player lost to EDGAR.)


We are heartened by the fact that the GA is in fact able to take advantage of increasingly larger terminal sets, and that giving it the power to be increasingly descriptive does produce increasingly skilled players. We have also learned that increasing the descriptive powers of the trees does not necessarily produce players which perform well against random players, as they might tend to be overfitted. This problem might be solved, however, by training our players against a fairly good non-deterministic player -- this might be more likely to result in more players which consistently beat both EDGAR and the Random Player.

Our study has made us curious about the questions of How helpful is each terminal? and Which are the most important terminals? We delved into that a bit while looking at the jump sizes between fitness ratings and performances between the two versions. It would also be interesting to study terminal sets of the same size to better answer this question. For example, it would be a good experiment to test out two terminal sets, both including white/black and white/black_edges, one including white/black_corners, the other including white/black_near_corners. This would be a more direct way to test the effectiveness/value of a specific terminal.

Meet Some Especially Notable Players

As suspected, all three of our All-Star players came from versions 4 and 5 and had "mediocre" fitness values. (Our data did include many players with fitness ratings of 0.0, but these did not perform as well against the Random Player, and thus could not achieve the envied rank of All-Star. Again, we might attribute this to overlearning.)

Allstar #1
Version: 5 Fitness: 20.0
Wins against random player (out of 100): 100
Score against Edgar: 59 to 4

Allstar #2
Version: 5
Fitness: 40.0
Wins against random player (out of 100): 95
Score against Edgar: 55 to 8

Allstar #3
Version: 4
Fitness: 25.0
Wins against random player (out of 100): 93
Score against Edgar: 59 to 5