TODO list #137

Expertium · 2024-12-10T07:39:28Z

This is just to keep track of stuff

~~Add sibling information in a way that FSRS can work with. I need 3 modifications of Anki 10k #136 (comment), https://discord.com/channels/368267295601983490/1282005522513530952/1320698604771348570~~ Done ✅
During testing, filter out same-day reviews using delta_t in days, but for calculations keep delta_t in seconds. https://forums.ankiweb.net/t/due-column-changing-days-from-whole-numbers-to-decimals-in-scheduling/52213/53?u=expertium
Done ✅
Benchmark obezag's idea. https://discord.com/channels/368267295601983490/1282005522513530952/1315873110171451422
~~Fine-tune the formula for interpolating missing S0: https://discord.com/channels/368267295601983490/1282005522513530952/1319714323680989256~~ ❌
~~Benchmark updating D before S~~ Done ✅
~~Benchmark setting weights of outliers to 0: https://discord.com/channels/368267295601983490/1282005522513530952/1320670294188363778~~ ❌
~~Benchmark FSRS v1 and FSRS v2~~ Done ✅

L-M-Sherlock · 2024-12-26T10:33:48Z

What's "Fine-tune the formula for interpolating missing S0"?

Expertium · 2024-12-26T10:39:07Z

Check the Discord message, I attached a link

In the 10k dataset, find users who haven't used one or two buttons during the first review, but started doing so in the second half of their review history

Estimate the missing S0 https://github.com/open-spaced-repetition/fsrs-optimizer/blob/9b9b700ea463a2505f28d8c04717d9bd34787d5e/src/fsrs_optimizer/fsrs_optimizer.py#L1107 based on the first halves of their review histories

Run the optimizer on the second halves, don't calculate S0 normally, fill in S0 using the previous estimates. Example: a user never used Good. You use your wacky formula (that I don't understand) to estimate S0. Then, for the second half of the review history, one where the user does use Good, use your estimate from the previous half

Get logloss/RMSE values, tweak w1 and w2 in your wacky formula

Repeat steps 2-4 many times until you find the best combination of w1 and w2

Expertium · 2024-12-26T10:51:00Z

Also, don't forget to benchmark FSRS v1 and FSRS v2. I'll add that to the list

L-M-Sherlock · 2024-12-26T10:57:27Z

Check the Discord message, I attached a link

I checked, but I don't get the details. Could you elaborate the idea here?

Expertium · 2024-12-26T11:47:26Z

You have w1 and w2, which are currently set to 3/5. The idea is to find users who have not used one or two grades during their first reviews during the first half of their review history, but started using those grades later, calculate missing S0 values based on that, and check how well they fit.
For example, say a user has never pressed Good during the first half of his review history. You use that first half to calculate the missing S0(Good). Then you use that S0(Good) during optimization on the second half of his review history, and check RMSE.
Then you do that for all users who meet these criteria (not used one or two grades during their first reviews during the first half of their review history, but used them in the second half) and for different values of w1 and w2.
If you have a better idea how to fine-tune w1 and w2, feel free to do that.

L-M-Sherlock · 2024-12-30T04:04:42Z

Only these collections match your conditions:

Expertium · 2024-12-30T09:28:48Z

Oh well. Forget about it then

Expertium · 2024-12-30T15:28:33Z

Oh, wait, @L-M-Sherlock

The first condition should be df_first_half["rating"].nunique()==2 or df_first_half["rating"].nunique()==3, not just df_first_half["rating"].nunique()==2, since w1 and w2 are used for interpolating S0 values both when one value is missing and when two values are missing.

Also, are you sure you are checking the right thing? We're not looking for users who never used a certain answer button, we're looking for users who didn't use a certain answer button for their first reviews. For example, if someone never used "Good" during their first review, but used it during their second/third/nth review, then he should count, and you should record his ID.
So you need something like this: df = df.loc[(df['elapsed_seconds'] == -1) & (df['elapsed_days'] == -1)], so that you only look at first reviews.
(I have "undone" item 4 on the todo list)

L-M-Sherlock · 2024-12-31T04:21:10Z

def process(user_id):
    df = pd.read_parquet(DATA_PATH, filters=[("user_id", "==", user_id)])
    df["review_th"] = range(1, df.shape[0] + 1)
    df.sort_values(by=["card_id", "review_th"], inplace=True)
    df.drop(df[df["elapsed_days"] == 0].index, inplace=True)
    df["i"] = df.groupby("card_id").cumcount() + 1
    df["y"] = df["rating"].map(lambda x: {1: 0, 2: 1, 3: 1, 4: 1}[x])
    df = df[(df["elapsed_days"] > 0) & (df["i"] == 2)].sort_values(by=["review_th"])
    length = len(df)
    df_first_half = df.iloc[:length // 2]
    df_second_half = df.iloc[length // 2:]
    if df_first_half["rating"].nunique() == 2 and df_second_half["rating"].nunique() == 4:
        print(user_id)
    return user_id

So you need something like this: df = df.loc[(df['elapsed_seconds'] == -1) & (df['elapsed_days'] == -1)], so that you only look at first reviews.

The df["i"] == 2 plays the same role here.

Expertium · 2024-12-31T13:05:56Z

Ok, but you haven't done this

The first condition should be df_first_half["rating"].nunique()==2 or df_first_half["rating"].nunique()==3, not just df_first_half["rating"].nunique()==2, since w1 and w2 are used for interpolating S0 values both when one value is missing and when two values are missing.

L-M-Sherlock · 2025-01-01T06:59:50Z

Expertium · 2025-01-01T07:11:20Z

That's not what I said, though
I meant like this

if (df_first_half["rating"].nunique()==2 or df_first_half["rating"].nunique()==3) and df_secod_half["rating"].nunique()==4:
    print(user_id)

L-M-Sherlock · 2025-01-01T08:43:22Z

Now we have ~300 collections.

Expertium · 2025-01-01T11:04:48Z

Nice. Now I want you to do what I described above:

Using df_first_half estimate S0. Use your formula with w1 and w2 to fill in missing values.
Run the optimizer on df_second_half, and use S0 values from the previous step.
Do steps 1-2 for each user, record average RMSE.
Change w1 and w2 and repeat steps 1-3 until you find good w1 and w2.

The key idea is that by using S0 from the first half we can check how well it fits the second half.

L-M-Sherlock · 2025-01-02T06:11:49Z

I think it's really hard to evaluate the S0 with such less data:

Expertium · 2025-01-02T07:20:36Z

Man...
Alright, forget about it then

Expertium · 2025-01-02T10:16:29Z

@L-M-Sherlock I have a better idea

Find all users who use all 4 buttons during the first review AND each button is used at least 200 times
So if you display button counts (for the first review only) like this:
Again: x1, Hard: x2, Good: x3, Easy: x4
Then each x must be >=200
Send me a .jsonl or .csv file with S0 values of each such user. If there are 5000 users like this, then the .jsonl file should have 5000 lines

I'll fine-tune w1 and w2 by removing 1-2 S0 values and filling them back in using the formula with w1 and w2, and then minimizing MAPE

Expertium · 2025-01-04T19:02:00Z

Alright, I did it myself.
I found 2642 users who have used each button during their first reviews AND each button was pressed at least 200 times.
Then for each list of initial stabilities I removed one stability (for Again, then for Hard, then for Good, then for Easy). That gives me 2642*4=10568 datapoints. Then I also removed two stabilities (Again and Hard, Again and Good, etc.). After doing both I ended up with 26420 datapoints aka dictionaries where either one or two S0 values are missing.
Then I used a Bayesian optimizer that I use for fine-tuning hyperparameters of neural networks, but it can be used with pretty much anything

from ax.service.ax_client import AxClient
parameters = [
    {"name": "w1", "type": "range", "bounds": [0.2, 1.8], "log_scale": False, "value_type": 'float'},
    {"name": "w2", "type": "range", "bounds": [0.2, 1.8], "log_scale": False, "value_type": 'float'}
]
ax = AxClient(random_seed=42)
ax.create_experiment(name="S0 Interpolation", parameters=parameters)
trials = 250
for i in range(trials):
    print(f"Starting trial {i + 1}/{trials}")
    parameters, trial_index = ax.get_next_trial()
    ax.complete_trial(trial_index=trial_index, raw_data=interpolate(parameters))

interpolate() is a function that takes a dictionary with missing S0 as input, fills them in and calculates MAPE like this:

    errors = []
    for x, y in zip(x_list, y_list):  # x list contains dictionaries with missing S0, y contains complete dictionaries
        interpolated = f_interpolate(w1, w2, x.copy())
        for rating in [1, 2, 3, 4]:
            if rating not in x and rating in interpolated:
                error = np.abs(y[rating] - interpolated[rating]) / y[rating]
                errors.append(error)

and returns -min(np.nanmean(errors), 1e50). Minus because I couldn't figure out how to make it do minimization instead of maximization (documentation says there is a keyword for that, but nope). nanmean is to prevent optimization from stopping if there are nans.

So here's what I get with the default w1=3/5, w2=3/5:
MAPE(0.60, 0.60)=86810.1%

And here's after the Bayesian optimization with w1=1.35, w2=0.68:
MAPE(1.35, 0.68)=244.2%

355 times more accurate! And yes, I checked that it doesn't result in nans.

See open-spaced-repetition#137 (comment)

L-M-Sherlock · 2025-01-05T05:56:01Z

What if the stability of again is 1 day and the stability of hard is 2 days? Could you give me the interpolated results of good and easy? IIRC, both w1 and w2 should be w1 and w2 should fall within the range of 0 to 1 to avoid weird output.

Expertium · 2025-01-05T10:03:11Z

What if the stability of again is 1 day and the stability of hard is 2 days? Could you give me the interpolated results of good and easy?

{1: 1, 2: 2, 3: 0.13801118920922661, 4: 0.03921993874306448}

Yeah, it's not monotonic. Perhaps we could make a better interpolation formula, but idk how.
Hold on, let me set the ranges to [0.1, 0.9] and see what I get via Bayesian optimization.

Ok, here's what it found
MAPE(0.41, 0.54)=375.4%
Worse than my last values, but better than the default ones.
Interpolation results: {1: 1, 2: 2, 3: 3.2375786810669958, 4: 4.879996457283174}
w1=0.41 and w2=0.54 looks good!
It's interesting that a relatively small change in w1 and w2 can reduce MAPE by more than 100 times.

* Fine-tuned w1 and w2 See #137 (comment) * New values to avoid weirdness

L-M-Sherlock · 2025-01-12T00:54:50Z

We have trouble... The L2 regularization doesn't work well with recency weighting:

Model: FSRS-5-recency-dev
Total number of users: 9999
Total number of reviews: 349923850
Weighted average by reviews:
FSRS-5-recency-dev LogLoss (mean±std): 0.3256±0.1518
FSRS-5-recency-dev RMSE(bins) (mean±std): 0.0495±0.0320
FSRS-5-recency-dev AUC (mean±std): 0.7055±0.0754

Weighted average by log(reviews):
FSRS-5-recency-dev LogLoss (mean±std): 0.3508±0.1682
FSRS-5-recency-dev RMSE(bins) (mean±std): 0.0691±0.0445
FSRS-5-recency-dev AUC (mean±std): 0.7045±0.0855

Weighted average by users:
FSRS-5-recency-dev LogLoss (mean±std): 0.3541±0.1706
FSRS-5-recency-dev RMSE(bins) (mean±std): 0.0720±0.0462
FSRS-5-recency-dev AUC (mean±std): 0.7037±0.0876

parameters: [0.4226, 1.3065, 3.2448, 15.8678, 7.1831, 0.5456, 1.6395, 0.0057, 1.5183, 0.119, 1.0057, 1.9383, 0.108, 0.2961, 2.2722, 0.2305, 2.9898, 0.4832, 0.6638]

Model: FSRS-5-recency
Total number of users: 9999
Total number of reviews: 349923850
Weighted average by reviews:
FSRS-5-recency LogLoss (mean±std): 0.3260±0.1520
FSRS-5-recency RMSE(bins) (mean±std): 0.0493±0.0322
FSRS-5-recency AUC (mean±std): 0.7041±0.0765

Weighted average by log(reviews):
FSRS-5-recency LogLoss (mean±std): 0.3519±0.1691
FSRS-5-recency RMSE(bins) (mean±std): 0.0689±0.0453
FSRS-5-recency AUC (mean±std): 0.7015±0.0879

Weighted average by users:
FSRS-5-recency LogLoss (mean±std): 0.3553±0.1719
FSRS-5-recency RMSE(bins) (mean±std): 0.0718±0.0473
FSRS-5-recency AUC (mean±std): 0.7005±0.0899

parameters: [0.4314, 1.1681, 3.2702, 15.8593, 7.1329, 0.5336, 1.7704, 0.0108, 1.5127, 0.1313, 1.004, 1.9192, 0.1034, 0.306, 2.3389, 0.2307, 3.0355, 0.4536, 0.6491]

Edit: I get it. The recency weighting decreases the the average loss, so the relative penalty of L2 regularization increases. I will decrease the gamma and re-benchmark it.

Expertium mentioned this issue Jan 3, 2025

Expt/regularization of parameters open-spaced-repetition/fsrs-optimizer#157

Merged

Expertium added a commit to Expertium/srs-benchmark that referenced this issue Jan 4, 2025

Fine-tuned w1 and w2

fec33a1

See open-spaced-repetition#137 (comment)

Expertium mentioned this issue Jan 4, 2025

Fine-tuned w1 and w2 #148

Merged

L-M-Sherlock pushed a commit that referenced this issue Jan 5, 2025

Fine-tuned w1 and w2 (#148)

e41b64b

* Fine-tuned w1 and w2 See #137 (comment) * New values to avoid weirdness

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

TODO list #137

TODO list #137

Expertium commented Dec 10, 2024 •

edited

Loading

L-M-Sherlock commented Dec 26, 2024

Expertium commented Dec 26, 2024

Expertium commented Dec 26, 2024

L-M-Sherlock commented Dec 26, 2024

Expertium commented Dec 26, 2024 •

edited

Loading

L-M-Sherlock commented Dec 30, 2024 •

edited

Loading

Expertium commented Dec 30, 2024

Expertium commented Dec 30, 2024 •

edited

Loading

L-M-Sherlock commented Dec 31, 2024

Expertium commented Dec 31, 2024

L-M-Sherlock commented Jan 1, 2025

Expertium commented Jan 1, 2025

L-M-Sherlock commented Jan 1, 2025

Expertium commented Jan 1, 2025

L-M-Sherlock commented Jan 2, 2025

Expertium commented Jan 2, 2025

Expertium commented Jan 2, 2025 •

edited

Loading

Expertium commented Jan 4, 2025 •

edited

Loading

L-M-Sherlock commented Jan 5, 2025 •

edited

Loading

Expertium commented Jan 5, 2025 •

edited

Loading

L-M-Sherlock commented Jan 12, 2025 •

edited

Loading

TODO list #137

TODO list #137

Comments

Expertium commented Dec 10, 2024 • edited Loading

L-M-Sherlock commented Dec 26, 2024

Expertium commented Dec 26, 2024

Expertium commented Dec 26, 2024

L-M-Sherlock commented Dec 26, 2024

Expertium commented Dec 26, 2024 • edited Loading

L-M-Sherlock commented Dec 30, 2024 • edited Loading

Expertium commented Dec 30, 2024

Expertium commented Dec 30, 2024 • edited Loading

L-M-Sherlock commented Dec 31, 2024

Expertium commented Dec 31, 2024

L-M-Sherlock commented Jan 1, 2025

Expertium commented Jan 1, 2025

L-M-Sherlock commented Jan 1, 2025

Expertium commented Jan 1, 2025

L-M-Sherlock commented Jan 2, 2025

Expertium commented Jan 2, 2025

Expertium commented Jan 2, 2025 • edited Loading

Expertium commented Jan 4, 2025 • edited Loading

L-M-Sherlock commented Jan 5, 2025 • edited Loading

Expertium commented Jan 5, 2025 • edited Loading

L-M-Sherlock commented Jan 12, 2025 • edited Loading

Expertium commented Dec 10, 2024 •

edited

Loading

Expertium commented Dec 26, 2024 •

edited

Loading

L-M-Sherlock commented Dec 30, 2024 •

edited

Loading

Expertium commented Dec 30, 2024 •

edited

Loading

Expertium commented Jan 2, 2025 •

edited

Loading

Expertium commented Jan 4, 2025 •

edited

Loading

L-M-Sherlock commented Jan 5, 2025 •

edited

Loading

Expertium commented Jan 5, 2025 •

edited

Loading

L-M-Sherlock commented Jan 12, 2025 •

edited

Loading