It is a simple comparison between FSRS and SM-17. FSRS-v-SM16-v-SM17.ipynb is the notebook for the comparison.
Due to the difference between the workflow of SuperMemo and Anki, it is not easy to compare the two algorithms. I tried to make the comparison as fair as possible. Here is some notes:
- The first interval in SuperMemo is the duration between creating the card and the first review. In Anki, the first interval is the duration between the first review and the second review. So I removed the first record of each card in SM-17 data.
- There are six grades in SuperMemo, but only four grades in Anki. So I merged 0, 1 and 2 in SuperMemo to 1 in Anki, and mapped 3, 4, and 5 in SuperMemo to 2, 3, and 4 in Anki.
- I use the
R (SM17)(exp)
recorded insm18/systems/{collection_name}/stats/SM16-v-SM17.csv
as the prediction of SM-17. Reference: Confusion among R(SM16), R(SM17)(exp), R(SM17), R est. and expFI. - To ensure FSRS has the same information as SM-17, I implement an online learning version of FSRS, where FSRS has zero knowledge of the future reviews as SM-17 does.
- The results are based on the data from a small group of people. It may be different from the result of other SuperMemo users.
We use two metrics in the FSRS benchmark to evaluate how well these algorithms work: log loss and a custom RMSE that we call RMSE (bins).
- Log Loss (also known as Binary Cross Entropy): Utilized primarily for its applicability in binary classification problems, log loss serves as a measure of the discrepancies between predicted probabilities of recall and review outcomes (1 or 0). It quantifies how well the algorithm approximates the true recall probabilities, making it an important metric for model evaluation in spaced repetition systems.
- Weighted Root Mean Square Error in Bins (RMSE (bins)): This is a metric engineered for the FSRS benchmark. In this approach, predictions and review outcomes are grouped into bins according to the predicted probabilities of recall. Within each bin, the squared difference between the average predicted probability of recall and the average recall rate is calculated. These values are then weighted according to the sample size in each bin, and then the final weighted root mean square error is calculated. This metric provides a nuanced understanding of model performance across different probability ranges.
Smaller is better. If you are unsure what metric to look at, look at RMSE (bins). That value can be interpreted as "the average difference between the predicted probability of recalling a card and the measured probability". For example, if RMSE (bins)=0.05, it means that that algorithm is, on average, wrong by 5% when predicting the probability of recall.
Total users: 16
Total repetitions: 194,281
The following tables represent the weighted means and the 99% confidence intervals.
Algorithm | Log Loss | RMSE(bins) |
---|---|---|
FSRS-4.5 | 0.37±0.088 | 0.06±0.023 |
FSRS-5 | 0.37±0.083 | 0.06±0.022 |
FSRSv4 | 0.38±0.087 | 0.06±0.024 |
FSRSv3 | 0.40±0.091 | 0.08±0.020 |
SM-17 | 0.41±0.097 | 0.08±0.020 |
SM-16 | 0.42±0.087 | 0.11±0.026 |
Algorithm | Log Loss | RMSE(bins) |
---|---|---|
FSRS-4.5 | 0.42±0.092 | 0.09±0.033 |
FSRS-5 | 0.42±0.075 | 0.09±0.031 |
SM-17 | 0.5±0.10 | 0.10±0.029 |
FSRSv4 | 0.44±0.082 | 0.10±0.042 |
FSRSv3 | 0.45±0.097 | 0.11±0.033 |
SM-16 | 0.5±0.11 | 0.12±0.033 |
The image below shows the p-values obtained by running the Wilcoxon signed-rank test on the RMSE (bins) of all pairs of algorithms. Red means that the row algorithm performs worse than the corresponding column algorithm, and green means that the row algorithm performs better than the corresponding column algorithm. Grey means that the p-value is >0.05, and we cannot conclude that one algorithm performs better than the other.
It's worth mentioning that this test is not weighted, and therefore doesn't take into account that RMSE (bins) depends on the number of reviews.
If you would like to support this project, please consider sharing your data with us. The shared data will be stored in ./dataset/ folder.
You can open an issue to submit it: https://github.com/open-spaced-repetition/fsrs-vs-sm17/issues/new/choose
leee_ 🔣 |
Jarrett Ye 🔣 |
天空守望者 🔣 |
reallyyy 🔣 |
shisuu 🔣 |
Winston 🔣 |
Spade7 🔣 |
John Qing 🔣 |
WolfSlytherin 🔣 |
HyFran 🔣 |
Hansel221 🔣 |
曾经沧海难为水 🔣 |
Pariance 🔣 |
github-gracefeng 🔣 |