Usefull literture:

AlphaGo

AlphaZero

Deep Pepper: alphazero clone

DeepChess: comparison based engine with α-β search

Learning to evaluate chess positions with deep neural networks and limited lookahead

Best first minimax: principal variation

Chessprogramming

Python chess

Chess is a game for two players in which win, draw and loss results are possible. The game consists of a state which is defined as an 8 by 8 board, with 6 types of pieces with respecive starting quantities: 8 pawns, 2 knights, 2 bishop, 2 rooks, 1 queen and 1 king for each player. Transitions of the position are clearly defined. Players can perform board transition alternately, starting with player with white pieces and shifting to player with black ones in a cycle. Every piece has an unique way of moving and attacking called commonly capturing, some of them such as en passant or castling are conditional, based on the board state. Pieces can capture opposing pices with attack move by occuping their square. King cannot be capured. When king is attaced it must be moved to another empty square or game becomes lost for the player that coudn't do so. There exists additional rule for promoting a pawn in which pawn that reaches other side of the board becomes different piece. There are few different ways of drawing a game which include stalemate, threefold repetition of the position and fifty-move rule. Game can be also lost by resignation or drawn by agreement. In competitive enviroments it is played with time controls where additional rules apply.

Problem of solving a chess game consists of being able to find a best move for each position. Best move means here board transition that given that consecutively best moves are played until the game termination the result is no worse than that achieved by any other than optimal sequance of moves, taking into consideration that we have influance only on the moves of the player of the same color as this of a starting position. The game of chess is trivially solved given infinite computational resources. This could be achieved for example by precomputed lookup table. However when realistic time and memory constraints are applied problem becomes more interesting. Few possible approches have been proposed. They can be distilled to two main propositions. One points at esimating state value, which is result of the game from position given that best moves are played. This can potentially be done by some function v(state, ...) which takes as input some states and ranks them or outputs the numeric value which we use for ranking purposes. Such numeric value can be interpreted in varius ways such as material balance or probability of winning. What's important is that if function **v** produces the perfect ranking we can easily solve the game. The other approach notes that by applying all possible transitions to a position recoursively to the end of the game it is possible to check all lines of play and choose the most favourable one. This is achieved with some search algorithm, usually with searching in a tree. It is worth noting that this can also lead to perfect play if we can expand all nodes and don't have to use heuristic evaluation on non-terminal nodes. The current best performing agents combine this two approaches. Details of each program can vary significantly. It is important to point that top programs are the effect of applying to them long tweaking or more recently lots of compute.

The goal of this project is to check how strong of an agent can be produced given limited precomputation and with general algorithms. Superhuman play has been achieved in the domain of chess and we do not try to simply replicate it but rather use chess as a testing environment for developing game AI. The basic idea is to use combination of neural networks and search tree techinques to produce a game playing agent.

The common way of representing a chess state is a bitboard, where boards is interpreted as a 3d object. Each of 2d planes of such an object represents the whole board with single type of information (for example only one piece). We encode the board using 16 planes: 12 for pieces, 2 for castling rights, 1 for en passant and 1 for side to move. This gives in total vector with 1024 inputs. Such representation encodes the whole board state information except the history, which is necessary in determining threefold repetition and fifty-move rule. We ignore it for now on purpuse to make representation simpler and if such rules are needed they can be added in higher-order logic of a program. Being able to represent the state doesn't mean however that it is a good input for the value function. Value function usually would need more abstract information about the position such as material balance, center control, tempo and so on. Bitboard representation is also sparse meaning that there are a lot of bits reresenting empty squares. Finally our representation isn't even concise leaving some unused bits. Therefore it would be good if we could map bitboard representation to some embedding that would be better input for the value function.

If we were to belive this Wiki page on Shannon number it would seem that upper bound on number of necessary bits to encode chess position is lower than 155. Using binary vector of size 1024 to do the same is a bit waseteful. This as we think isn't a big concern in itself, because we would be able to process such an input nontheless. The problem with processing bitboard representation through some function with learnable parameters such as neural network is that the unnecessary weights in a model create added noise. What's more chess is a bit like picture recognition. If we want to recognize a face in a picture we may not care in which region of the photograph it is. The same thing applies to chess pieces. Many evaluations of the piece value may not depend on the position of said pice on the board. Nontheless if we would be to use bitboard representation the same computation would need to be performed and learned for every square. Thus being able to embed our state in lower dimentional space the unnecessary spatial relations can be ignored resulting in hopefully simpler problem. We try to perform embedding from vector of length 1024 to 256. To do so convolutional autoencoder learned layer by layer is applied.

Classically estimating value of the position in chess was done by combining in the linear fashion handcrafter parameters. The relative weights of such parameters may have been learned in some fashion, just like in Deep Blue. Parameters themselves though were not discovered by search but rather effect of theoretical study of the game. We will try to estimate the value from the embedding. The board state abbreviated with encoder will serve as features. These 256 parameters will act as something in between inputing raw board position and highly selected parameters. Our value function should select, combine and finally evaluate those features. Some papers (Sabatelli et al.) and our experiance suggest that in contrast to Deepmind architecture using plain MLP is better than CNN at this task. Let's note that we in fact combine the two approches by using the embedding not the raw position as input. Additionaly to evaluation the policy may be calculated. The result of this would be the so called hydra network, with two heads: one returning estimated value and the other probability or equivaluent over possible moves. The policy will be usefull in search algorithm (MCTS like in AlphaGo paper but also possibly may help with move ordering in algorithms such as alpha-beta).

To find which algorithm is applicable to our search space we need to know something about the problem. In chess all transitions are defined as moves of the pieces and next state can be derived solely from current state and transition. There are also no probabilistic transitions and all information is public as there are no hidder variables. Also the evaluation should get better as we search deeper. This is obvioulsy the effect of states getting closer to termial state but also positions in chess tend to contain less information as pieces get traded. The selectivity in search is quite high in chess, as the best moves are checked unproportionally more than the weaker ones. The branching factor for chess is on average 35 and human game length 40. This gives 40^35 or ~1.18 * 10^56 nodes that game can end in. Classically alpha-beta was the go-to algorithm for chess, but more recently MCTS has also been applied. Both of them were used with some large modifications. Alpha-beta is highly tuned for chess but it is used with very fast evaluation function. If we want to use slower and possibly more accurate one we would be better off with more selective search. Thus we will use MCTS algorithm as a platform for buidling case specific search engine. This approach is similar to that in AlphaZero. The basic paradigm of MCTS is that one ought to choose & expand the most promising node, next evaluate it by random simulation and next backpropagate the results to the tree above. This idea has been developed for Go plaing programs and isn't necessearly fitting for our problem. Let's now analyze chess search in this framework. MCTS used in AlphaZero didn't use core idea of simulations but rather just evaluation by NN. Node choice is done via so called UCT formula which tries to combine value with uncertanity parameter. Backpropagation is as simple as it gets. For every node backpropagated results average over its subtree. The best node is finally chosen on weak condition of total number of visits. In contrast alpha-beta doesn't choose, except for cutoffs, but rather expands all nodes in order. Evaluated are only the leaf nodes. Backpropagation is done at the end in bottom-up fashion. The final choice is done by strong condition which chooses the best leaf-node in the subtree. Note that MCTS has only one major advatage - selectivity and is poorly suited to the problem is all other areas - random playouts, node choice, and even UCT. The subtree averaging is harder to asses, it doesn't fit problem at all as the quality of possition cannot be said to be equal to average of possible positions resulting from it. On the other hand it helps with minimizing the propagated error in the network as noted in AZ paper.

As noted earlier the classical Alpha-beta search is spetialized to deal with one type of problem and is highly inflexible. It is in fact just a depth-first search with addition of case specific heuristics. We do not outright dissmiss it, but think that one should at least be concious of other options when combining NNs with it. On the other hand we can use MCTS framework to tailor a fitting search. Trying to omit the problems that come with using MCTS like in AlphaZero, which we wrote about earlier, we try to customize it for our needs. We conjecture that in games such as chess there exists a mapping which we call **z(v)** that maps the estimated value to the "probability" of picking a move. (To be clear, to get the actual probabililty in the implementation we have to normalize those values such that z(v_i) for each *i* child node sum to 1.) Next we explore each node with the likelihood equal to that of the estimated probability. Matching the time we explore the situation with the probability of situation occuring seems like a common sense. In order to pick a node we keep track of *action value* associated with each node. This value is equal to sum of probabilities multiplied by respective action values for all the child nodes. This definition is recurrent and for leaf nodes the action value is defined to be equal to the value which we obtain from the estimator. To update the probabilities and action value we perform some simple calculations which I won't discuss for now. Finally the the node that maximizes the action value is chosen as a result of a search. The algorithm has few parameters. Most importantly there is *cParam* which is dependent on the choosiness of players observed in a given game (We are obviously interested in chess specifically).

Note: I will probably write more about this topic after I test and compare this solution to the other approaches.

to do

- Layerwise training of autoencoder from 700 loss (0.68 pawn diff per repr. square) to 23 loss (~0.02).
- Layerwise retraining of autoencoder up to under 10 loss (~0.01).