Methods and Metrics in Performance Comparison and Evaluation

Introduction

Comparing the performance of two given models is essential in training of a mahjong AI. Which is the best model, how far along is the training, has the training already converged, and what are the optimal hyperparameters? Performance comparisons among models are indispensable in making judgments on them. Learning without performance comparisons is tantamount to groping in the dark.

Several research papers and web articles on developing mahjong AI use various metrics for performance comparison and evaluation. Examples include the value of the loss function, the rate that the model's selection matches the selection of expert mahjong players on particular points in game records, which is known as the agreement rate on game records (牌譜一致率), and so on. However, in the course of this project, it has been becoming clear that these indirect measures are of little use in evaluating model performance.

Therefore, the most direct method is used to compare and evaluate the performance of mahjong AI models in this project. In other words, two given models are actually played against each other, and the statistical results obtained from a large amount of actual games are used as a benchmark for performance comparison.

In the performance comparison and evaluation in this project, the model to be evaluated is called the proposed model, and the model that will be used as the baseline for the performance evaluation of the proposed model in the performance comparison is called the baseline model.

Styles of Comparison between Two Models for 4-player Mahjong

In comparing given two models, the proposed model and the baseline model, by actual 4-player mahjong games, three different styles can be considered. In this project, they are called the 1vs3 style, the 2vs2 style, and the 3vs1 style, respectively.

In the 1vs3 style, one of the four mahjong players is played by the proposed model and the other three are played by the baseline model. In the 2vs2 style, two players are played by the proposed model and the other two are played by the baseline model. In the 3vs1 style, the roles of the proposed model and the baseline model in the 1vs3 style are swapped.

Duplicate Mahjong for Performance Comparison

The most difficult aspect of comparing the performance of mahjong AIs is that, mahjong is an incomplete information game in which a large portion of the extremely large state space is hidden, the results obtained in each trial are highly random, and a large number of trials are required to obtain statistically reliable results of performance evaluation.

To significantly alleviate the above difficulties, this project introduces a simple but very effective trick in comparing and evaluating performance in actual mahjong games.

The trick is called duplicate mahjong (複式麻雀).

The idea is simple. In duplicate mahjong, multiple games are played with the same tile wall, and therefore the same initial hands. For example, in the 1vs3 style, there are a total of 4 possible seating arrangements for the two models, the proposed model and the baseline model, so 4 games are played with the same tile wall but with different seating arrangements. These 4 games constitute one set of duplicate mahjong. Multiple sets each consisting of 4 games with the same tile wall are conducted. The 3vs1 style also has 4 games per set. The 2vs2 style has 6 possible seating arrangements for the two models, so 6 games with the same tile wall constitute one set.

By using this trick, it is expected to drastically reduce the randomness of mahjong and obtain much more reliable results.

Metrics of Quantitative Performance Comparison and Evaluation

For each metric, not only its mean value is shown, but also its unbiased sample variance. In addition, the 95% and 99% confidence intervals for each metric calculated from the distribution that it (approximately) obeys are also provided.

Average Ranking of the Proposed Model

Values between completely identical models converge to 2.5.
Lower is better.
This obeys the Student's t-distribution with n - 1 degrees of freedom, where n is equal to (# of games)×PoV of the proposed model.

Average Game Delta of Grading Points of the Proposed Model

This metric is the expected increase or decrease in grading points of the proposed model per game, assuming that the grade of the proposed model is Saint 3 and that all games are played in the Jade room.

Values between completely identical models converge to -18.75.
Higher is better.
This obeys the Student's t-distribution with n - 1 degrees of freedom, where n is equal to (# of games)×PoV of the proposed model.

Average Game Delta of Soul Points of the Proposed Model

This metric is the expected increase or decrease in soul points of the proposed model per game, assuming that the grade of the proposed model is Celestial and that all games are played in the Throne room.

Values between completely identical models converge to 0.
Higher is better.
This obeys the Student's t-distribution with n - 1 degrees of freedom, where n is equal to (# of games)×PoV of the proposed model.

Top Rate of the Proposed Model

This metric is the rate of games in which the proposed model takes the first place.

Values between completely identical models converge to 0.25.
Higher is better.
The number of games where the proposed model takes the first place obeys the binomial distribution, and its rate can be approximated very well by the normal distribution if the total number of games is reasonably large.

Quinella Rate of the Proposed Model

This metric is the rate of games in which the proposed model takes the first or second place. Note that quinella rate is called "連対率" in Japanese.

Values between completely identical models converge to 0.5.
Higher is better.
The number of games where the proposed model takes the first or second place obeys the binomial distribution, and its rate can be approximated very well by the normal distribution if the total number of games is reasonably large.

Average Ranking Difference between the Proposed Model and the Baseline Model

This metric is a statistic about the average ranking of the proposed model minus the average ranking of the baseline model in each game. For example, if the proposed model takes the 2nd place in a 1vs3 style game, the average rank of the proposed model is equal to 2. On the other hand, since the baseline model takes the 1st, 3rd, and 4th places in this game, the average rank of the baseline model is equal to 8/3 (≒2.66). Thus, the metric for this game is 2-8/3=-2/3 (≒-0.66).

Values between completely identical models converge to 0.
Lower is better.
This statistic is about a set of independent paired samples that are dependent each other. Thus, this is subject to a paired difference test, and the difference obeys the Student's t-distribution for dependent paired samples.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly