Genetic Programming Applied to Board Game Othello

(W4771 Machine Learning Homework 4)

Introduction

In this project, genetic programming is applied to learn playing a board game called Othello. Othello has a square 8x8 gridded board. Two players (one uses white pieces, and the other black) are in turn to put their pieces onto the board, one piece a time. The rule is that you put a piece in order to "eat" your opponent's pieces by surrounding them: all of his pieces that are between two of yours (in a straight line) become yours. You can put a piece only when this move will "eat", otherwise you lose your turn. The game starts with two pieces of each player already at the center of the board, and the player who has more pieces in the end of the game wins the game.

The learning problem is:

Task T: playing Othello
Performance Measure P: percent of 50 games won when play against Random player and/or Edgar
Training Experience E: games played against trainers (Random player or Edgar)

The target knowledge of this learning problem is a function capable of ranking all legal moves and choosing the best ones. Genetic programming has been chosen as the learning mechanism in this project and we are provided with a java implementation which represents the target function with 13 primitives, + - * / 10 white white_corners white_near_corners white_edges black black_corners black_near_corners black_edges. The first four primitives are standard arithmetic operators and the rest of the primitives are terminals, all but one (constant 10) representing positions on the board. The Java implementation (including all standard genetic programming constructs such as crossover, mutation, and ADFs) is capable of producing and evolving target functions based on the fitness of target functions. The target function satisfying a specified fitness requirement is the solution: a competent Othello player. In this project, we will modify and extend the current genetic programming setup for improvement.

Motivation

I did not know game Othello until this project. When I played with Edgar I did not even know what the game was about. It took me about 5 minutes to figure out how to play and about one hour to tie with Edgar and after one and half hour I was already very confident I can beat Edgar. I personally experience a whole learning process, which gives me some insight into the game.

It is intuitive that a good player will use different strategies at different stages of a game to secure a win. The original implementation confines the solution to one target function which is used to play Othello from the start to the end. In this project, we tested variations on hypothesis which may behave differently in the end-game versus other points in the game: to be specific, the hypothesis is composed of separate trees, each is good at playing different temporal epochs of the game.

The purpose of this experiment is to see whether and how the target functions may differ at different epochs of the game, and more interestingly, to see whether a hypothesis with staged target function will outperform hypotheses with a uniform target function.

Overview

In this project, we experimented "staged" hypothesis technique hoping to help hypothesis learn somewhat "directly". Variations on training parameters such as population size, termination fitness are experimented along with different stage points in order to reveal how these factors may influence the staged hypothesis.

Approach

In order to experiment with staged hypothesis we need to modify the original implementation so that a separate tree can be trained for a specified epoch of the game and the staged hypothesis consisting of multiple trees can be used for playing the game.

In GPOthello.java, where all of the classes for the GP training implementation of Othello are defined, three functions are mainly responsible for characterizing different learning tasks: the first two are GPOthello.createNodeSet() and GPOthello.evaluate(), which define the primitives (syntax) and how the primitives are used to express target function (semantics) respectively; the third one, GPOthelloGP.evaluate(), which actually defines how fitness is calculated, is the place where the training data are incorporated. Since we are mainly interested in exploiting the expressiveness in temporal dimension of the hypothesis space, we keep GPOthello.createNodeSet() and GPOthello.evaluate() intact, i.e., we use the same syntax and semantics as in the original implementation, readers who are interested in new primitives please refer to corresponding write-ups.

Modification was made in GPOthelloGP.evaluate(). The original implementation has two loops. First loop is over 5 separate games (playing against Random Player) and second loop is over all moves in a single game. The opponent's score on the board at the end of each game is considered as rawFitness and its summation over 5 games is used as standard fitness, which is what genetic evolution is trying to bring down. The modification is to add a condition in the inner loop so that only opponent's score gained during certain epoch (rather that whole game) is used as fitness. It is understood that, the target function thus obtained is specialized, strong at playing the epoch of a game it is trained to do better. The modified GPOthllo was run, for instance, twice to obtain two separate target functions F1 and F2, one is good at playing the first half of game and the other is good at the second half. One may argue that a more accurate way is to have two trees play their epoch during the training, like what we did next in using a combined target function to play tournament. In this sense, our approach is an approximation, we may call it mean-field approximation, which works well as supported by results in the next section.

A staged hypothesis H is formed by combining these two functions. The next step is to modify the OthelloCache.java where the trained hypothesis is used to play against a random player. The constructor of OthelloCache was modified so that it can accept a staged hypothesis whose genetic representation has two or more target functions (in prefix format) each followed by its epoch separated by comma. For example,

java OthelloCache scoreonly "( white ),50,( + white white_corners ),50"

gives a staged hypothesis which has two target functions: function ( white ) and function ( + white white_corners ) play the first 50% and second 50% of the game respectively. In OthelloCache constructor, by using a StringTokenizer, different target functions were extracted and used to construct corresponding OthelloGPPlayer. An array of OthelloGPPlayers and and their epochs are passed to method playGame(OthelloGPPlayer[], int[]). Another way to do it is to subclass OthelloGPPlayer so a player can have multi-stage target function. This is just a design issue, we decided to keep OthelloGPPlayer but construct individual palyer for each stage. Next, in playGame, a timer is added into the loop of while(!board.gameOver()). During the whole game, different target functions are called in and designated as whitePlayer according to their epoch to play the game, i.e., to decide a move:

currentMove = whitePlayer.bestMove(board)

Results

As a first test with our staged hypothesis approach, we utilized two simple trees and different temporal combination of them to exemplify the effect. It is noticed that, without plunging into much trials and errors, the results are already convincing.

Table 1. A two-stage hypothesis

Tree(s)	Wins (out of 50 games against RandomPlayer)
( white )	44
( white_corners )	1
( white ),50,( white_corners ),50	43
( white ),75,( white_corners ),25	49

The Genetic player is playing White. It is obvious that a naive target function is simply the total number of one's own pieces on the board. A move which can increase this number most is considered the best move. Though sounds naive and too simplistic, this hypothesis turns out very effective, it wins 44 out of 50 games when plays against a random player. From our experience, we know the corner positions are very critical for a final win. Therefore, we simply use ( white_corners ) as another hypothesis. The result, only 1 win out of 50, may be surprising at a first glance. It does, however, make sense when you think it again. While corner positions are very important, you only get chance to go there late during a game. This suggests us to combine these two hypotheses in a way that they are used only for their strength. The experiments with this staged hypothesis support our idea. The result in the last row of the above table indicates that the strategy of using ( white ) to play the first three quarters of the game and ( white_corners ) the last quarter outperform others.

While the above results are convincing, the hypotheses used are picked up by hands (Note: each of these two trees is indeed reachable by GPs). It is desirable to experiment with more cases, especially some variations on training parameters. The results of such experiments are reported in the following table.

Table 2. Variations With Different Termination Fitness

Termination Fitness	Tree1	Tree1 Wins	Tree2	Tree2 Wins	50\|50 staged Wins (out of 50 games against RandomPlayer)
50.0	( + white ( + white_corners black_near_corners ) white )	46	( / ( - white_near_corners black_corners ) ( - black_corners white_near_corners ) )	49	45
25.0	( white )	44	( + ( + white_corners white ) ( * ( * white_edges black_edges ) ( - white_corners black_near_corners ) ) )	46	50

As seen in the above results, the abrupt transit from one tree to another does not always improve the performance. It is reasonable to believe a smoother transit may give better results. A smoother transit can be implemented by allowing epochs of two trees to have certain overlay and during the overlapped period, which tree is used is decided probabilistically by some distribution P, which favors gradually from tree1 to tree2, as schematically described in the following figure. Because of the time, this idea is not pursued.

Bug Report

Some effort has been spent in understanding the original implementation. As a result of this experience (sometimes painful though), here I report a bug in the original implementation, which I believe might have had some impact on previous experiments. The bug is in the method OthelloGPPlayer.evalSubFunction(). The method, at the heart of an OthelloGPPlayer, interprets the target function and uses it to evaluate and rank all current legal moves given a board statistics (i.e., how many whites, blacks, white_corners, and the like on the board). Given a board statistics, the method takes the target function as a StringTokenizer and calculates the target function recursively (when a token is an operator, i.e., + - * /). The original implementation calculate a target function incorrectly when the target function contains a closing brace and opening brace back by back, like ") (" which appears in an expression "( + ( * white white ) ( * black black ))" for instance. Because in the original implementation, if a closing brace is encountered, it simply moves on. The downstream conditions (unnecessary as we will see soon) do not handle an opening brace and the method never recursively calls itself (so that it can handle an opening brace) except for the case that a token is an operator. What happens is that the contribution from sub trees is assigned as zero instead, though you will be alerted with a "Parse Error". To correct the bug, simply delete conditions:

      if (item.equals(")")) 
        item = stok.nextToken();
      if (item.equals("))"))
        item = stok.nextToken();
      if (item.equals(")))"))
        item = stok.nextToken();
      if (item.equals("))))"))
        item = stok.nextToken();

add a new condition

      if (item.equals(")")) {
        return evalSubFunction(stok,white,black,
                               white_edges,black_edges,
                               white_corners, black_corners,
                               white_near_corners, black_near_corners);
      }

after the condition

      if (item.equals("10")) {
        return 10.0;
      }

With the above modification, all target function in prefix format can calculated correctly. A Caveat is that all primitives in an expression must be separated by a space.

Conclusions

>From the experiments conducted in this project, it is convincing that in most cases the "staged" hypothesis technique is likely an effective approach to producing better hypotheses in the context of genetic programming. However, a few notes are worth to make about current implementation of this technique and the limitation of this technique. First, as seen above, the performance is not always guaranteed to improve even with a two stage hypothesis. Second, while staged hypothesis seems to be capable of incorporating temporal behaviour somehow, it is unlikely that this technique will still pay off when it is abused, i.e., staged to too many epochs. To understand why it will not work, simply think of the extreme case where every step of the game is taken care of by a separate individual target function specialized for that step. This strategy will not work because of the dynamic nature of playing a board game: a move seems bad at one point may lead to a good payoff later on, it means, simply put, there is no tree best for a specific step of a game. To learn, you have to do interpolation and/or extrapolation, namely, bias. This justifies our mean-field approach to building a staged hypothesis. As mentioned at the end of the Results section, it would be interesting, as a further study, to see a smoother transit from epoch to next.