Skip to content

open-spaced-repetition/fsrs-vs-sm17

Repository files navigation

FSRS vs SM-17

All Contributors

It is a simple comparison between FSRS and SM-17. FSRS-v-SM16-v-SM17.ipynb is the notebook for the comparison.

Due to the difference between the workflow of SuperMemo and Anki, it is not easy to compare the two algorithms. I tried to make the comparison as fair as possible. Here is some notes:

  • The first interval in SuperMemo is the duration between creating the card and the first review. In Anki, the first interval is the duration between the first review and the second review. So I removed the first record of each card in SM-17 data.
  • There are six grades in SuperMemo, but only four grades in Anki. So I merged 0, 1 and 2 in SuperMemo to 1 in Anki, and mapped 3, 4, and 5 in SuperMemo to 2, 3, and 4 in Anki.
  • I use the R (SM17)(exp) recorded in sm18/systems/{collection_name}/stats/SM16-v-SM17.csv as the prediction of SM-17. Reference: Confusion among R(SM16), R(SM17)(exp), R(SM17), R est. and expFI.
  • To ensure FSRS has the same information as SM-17, I implement an online learning version of FSRS, where FSRS has zero knowledge of the future reviews as SM-17 does.
  • The results are based on the data from a small group of people. It may be different from the result of other SuperMemo users.

Metrics

We use two metrics in the FSRS benchmark to evaluate how well these algorithms work: log loss and a custom RMSE that we call RMSE (bins).

  • Log Loss (also known as Binary Cross Entropy): Utilized primarily for its applicability in binary classification problems, log loss serves as a measure of the discrepancies between predicted probabilities of recall and review outcomes (1 or 0). It quantifies how well the algorithm approximates the true recall probabilities, making it an important metric for model evaluation in spaced repetition systems.
  • Weighted Root Mean Square Error in Bins (RMSE (bins)): This is a metric engineered for the FSRS benchmark. In this approach, predictions and review outcomes are grouped into bins according to the predicted probabilities of recall. Within each bin, the squared difference between the average predicted probability of recall and the average recall rate is calculated. These values are then weighted according to the sample size in each bin, and then the final weighted root mean square error is calculated. This metric provides a nuanced understanding of model performance across different probability ranges.

Smaller is better. If you are unsure what metric to look at, look at RMSE (bins). That value can be interpreted as "the average difference between the predicted probability of recalling a card and the measured probability". For example, if RMSE (bins)=0.05, it means that that algorithm is, on average, wrong by 5% when predicting the probability of recall.

Result

Total users: 16

Total repetitions: 194,281

The following tables represent the weighted means and the 99% confidence intervals.

Weighted by number of repetitions

Algorithm Log Loss RMSE(bins)
FSRS-4.5 0.37±0.088 0.06±0.023
FSRS-5 0.37±0.083 0.06±0.022
FSRSv4 0.38±0.087 0.06±0.024
FSRSv3 0.40±0.091 0.08±0.020
SM-17 0.41±0.097 0.08±0.020
SM-16 0.42±0.087 0.11±0.026

Weighted by ln(number of repetitions)

Algorithm Log Loss RMSE(bins)
FSRS-4.5 0.42±0.092 0.09±0.033
FSRS-5 0.42±0.075 0.09±0.031
SM-17 0.5±0.10 0.10±0.029
FSRSv4 0.44±0.082 0.10±0.042
FSRSv3 0.45±0.097 0.11±0.033
SM-16 0.5±0.11 0.12±0.033

The image below shows the p-values obtained by running the Wilcoxon signed-rank test on the RMSE (bins) of all pairs of algorithms. Red means that the row algorithm performs worse than the corresponding column algorithm, and green means that the row algorithm performs better than the corresponding column algorithm. Grey means that the p-value is >0.05, and we cannot conclude that one algorithm performs better than the other.

It's worth mentioning that this test is not weighted, and therefore doesn't take into account that RMSE (bins) depends on the number of reviews.

Wilcoxon-16-collections

Share your data

If you would like to support this project, please consider sharing your data with us. The shared data will be stored in ./dataset/ folder.

You can open an issue to submit it: https://github.com/open-spaced-repetition/fsrs-vs-sm17/issues/new/choose

Contributors

leee_
leee_

🔣
Jarrett Ye
Jarrett Ye

🔣
天空守望者
天空守望者

🔣
reallyyy
reallyyy

🔣
shisuu
shisuu

🔣
Winston
Winston

🔣
Spade7
Spade7

🔣
John Qing
John Qing

🔣
WolfSlytherin
WolfSlytherin

🔣
HyFran
HyFran

🔣
Hansel221
Hansel221

🔣
曾经沧海难为水
曾经沧海难为水

🔣
Pariance
Pariance

🔣
github-gracefeng
github-gracefeng

🔣