Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TODO list #137

Open
Expertium opened this issue Dec 10, 2024 · 21 comments
Open

TODO list #137

Expertium opened this issue Dec 10, 2024 · 21 comments

Comments

@Expertium
Copy link
Contributor

Expertium commented Dec 10, 2024

This is just to keep track of stuff

  1. Add sibling information in a way that FSRS can work with. I need 3 modifications of Anki 10k #136 (comment), https://discord.com/channels/368267295601983490/1282005522513530952/1320698604771348570 Done ✅
  2. During testing, filter out same-day reviews using delta_t in days, but for calculations keep delta_t in seconds. https://forums.ankiweb.net/t/due-column-changing-days-from-whole-numbers-to-decimals-in-scheduling/52213/53?u=expertium
    Done ✅
  3. Benchmark obezag's idea. https://discord.com/channels/368267295601983490/1282005522513530952/1315873110171451422
  4. Fine-tune the formula for interpolating missing S0: https://discord.com/channels/368267295601983490/1282005522513530952/1319714323680989256
  5. Benchmark updating D before S Done ✅
  6. Benchmark setting weights of outliers to 0: https://discord.com/channels/368267295601983490/1282005522513530952/1320670294188363778
  7. Benchmark FSRS v1 and FSRS v2 Done ✅
@L-M-Sherlock
Copy link
Member

What's "Fine-tune the formula for interpolating missing S0"?

@Expertium
Copy link
Contributor Author

Check the Discord message, I attached a link

  1. In the 10k dataset, find users who haven't used one or two buttons during the first review, but started doing so in the second half of their review history
  2. Estimate the missing S0 https://github.com/open-spaced-repetition/fsrs-optimizer/blob/9b9b700ea463a2505f28d8c04717d9bd34787d5e/src/fsrs_optimizer/fsrs_optimizer.py#L1107 based on the first halves of their review histories
  3. Run the optimizer on the second halves, don't calculate S0 normally, fill in S0 using the previous estimates. Example: a user never used Good. You use your wacky formula (that I don't understand) to estimate S0. Then, for the second half of the review history, one where the user does use Good, use your estimate from the previous half
  4. Get logloss/RMSE values, tweak w1 and w2 in your wacky formula
  5. Repeat steps 2-4 many times until you find the best combination of w1 and w2

@Expertium
Copy link
Contributor Author

Also, don't forget to benchmark FSRS v1 and FSRS v2. I'll add that to the list

@L-M-Sherlock
Copy link
Member

Check the Discord message, I attached a link

I checked, but I don't get the details. Could you elaborate the idea here?

@Expertium
Copy link
Contributor Author

Expertium commented Dec 26, 2024

You have w1 and w2, which are currently set to 3/5. The idea is to find users who have not used one or two grades during their first reviews during the first half of their review history, but started using those grades later, calculate missing S0 values based on that, and check how well they fit.
For example, say a user has never pressed Good during the first half of his review history. You use that first half to calculate the missing S0(Good). Then you use that S0(Good) during optimization on the second half of his review history, and check RMSE.
Then you do that for all users who meet these criteria (not used one or two grades during their first reviews during the first half of their review history, but used them in the second half) and for different values of w1 and w2.
If you have a better idea how to fine-tune w1 and w2, feel free to do that.

@L-M-Sherlock
Copy link
Member

L-M-Sherlock commented Dec 30, 2024

Only these collections match your conditions:

image

@Expertium
Copy link
Contributor Author

Oh well. Forget about it then

@Expertium
Copy link
Contributor Author

Expertium commented Dec 30, 2024

Oh, wait, @L-M-Sherlock
image
The first condition should be df_first_half["rating"].nunique()==2 or df_first_half["rating"].nunique()==3, not just df_first_half["rating"].nunique()==2, since w1 and w2 are used for interpolating S0 values both when one value is missing and when two values are missing.

Also, are you sure you are checking the right thing? We're not looking for users who never used a certain answer button, we're looking for users who didn't use a certain answer button for their first reviews. For example, if someone never used "Good" during their first review, but used it during their second/third/nth review, then he should count, and you should record his ID.
So you need something like this: df = df.loc[(df['elapsed_seconds'] == -1) & (df['elapsed_days'] == -1)], so that you only look at first reviews.
(I have "undone" item 4 on the todo list)

@L-M-Sherlock
Copy link
Member

def process(user_id):
    df = pd.read_parquet(DATA_PATH, filters=[("user_id", "==", user_id)])
    df["review_th"] = range(1, df.shape[0] + 1)
    df.sort_values(by=["card_id", "review_th"], inplace=True)
    df.drop(df[df["elapsed_days"] == 0].index, inplace=True)
    df["i"] = df.groupby("card_id").cumcount() + 1
    df["y"] = df["rating"].map(lambda x: {1: 0, 2: 1, 3: 1, 4: 1}[x])
    df = df[(df["elapsed_days"] > 0) & (df["i"] == 2)].sort_values(by=["review_th"])
    length = len(df)
    df_first_half = df.iloc[:length // 2]
    df_second_half = df.iloc[length // 2:]
    if df_first_half["rating"].nunique() == 2 and df_second_half["rating"].nunique() == 4:
        print(user_id)
    return user_id

So you need something like this: df = df.loc[(df['elapsed_seconds'] == -1) & (df['elapsed_days'] == -1)], so that you only look at first reviews.

The df["i"] == 2 plays the same role here.

@Expertium
Copy link
Contributor Author

Ok, but you haven't done this

The first condition should be df_first_half["rating"].nunique()==2 or df_first_half["rating"].nunique()==3, not just df_first_half["rating"].nunique()==2, since w1 and w2 are used for interpolating S0 values both when one value is missing and when two values are missing.

@L-M-Sherlock
Copy link
Member

image

@Expertium
Copy link
Contributor Author

That's not what I said, though
I meant like this

if (df_first_half["rating"].nunique()==2 or df_first_half["rating"].nunique()==3) and df_secod_half["rating"].nunique()==4:
    print(user_id)

@L-M-Sherlock
Copy link
Member

image

Now we have ~300 collections.

@Expertium
Copy link
Contributor Author

Nice. Now I want you to do what I described above:

  1. Using df_first_half estimate S0. Use your formula with w1 and w2 to fill in missing values.
  2. Run the optimizer on df_second_half, and use S0 values from the previous step.
  3. Do steps 1-2 for each user, record average RMSE.
  4. Change w1 and w2 and repeat steps 1-3 until you find good w1 and w2.

The key idea is that by using S0 from the first half we can check how well it fits the second half.

@L-M-Sherlock
Copy link
Member

I think it's really hard to evaluate the S0 with such less data:

image image image

@Expertium
Copy link
Contributor Author

Man...
Alright, forget about it then

@Expertium
Copy link
Contributor Author

Expertium commented Jan 2, 2025

@L-M-Sherlock I have a better idea

  1. Find all users who use all 4 buttons during the first review AND each button is used at least 200 times
    So if you display button counts (for the first review only) like this:
    Again: x1, Hard: x2, Good: x3, Easy: x4
    Then each x must be >=200
  2. Send me a .jsonl or .csv file with S0 values of each such user. If there are 5000 users like this, then the .jsonl file should have 5000 lines

I'll fine-tune w1 and w2 by removing 1-2 S0 values and filling them back in using the formula with w1 and w2, and then minimizing MAPE

@Expertium
Copy link
Contributor Author

Expertium commented Jan 4, 2025

Alright, I did it myself.
I found 2642 users who have used each button during their first reviews AND each button was pressed at least 200 times.
Then for each list of initial stabilities I removed one stability (for Again, then for Hard, then for Good, then for Easy). That gives me 2642*4=10568 datapoints. Then I also removed two stabilities (Again and Hard, Again and Good, etc.). After doing both I ended up with 26420 datapoints aka dictionaries where either one or two S0 values are missing.
Then I used a Bayesian optimizer that I use for fine-tuning hyperparameters of neural networks, but it can be used with pretty much anything

from ax.service.ax_client import AxClient
parameters = [
    {"name": "w1", "type": "range", "bounds": [0.2, 1.8], "log_scale": False, "value_type": 'float'},
    {"name": "w2", "type": "range", "bounds": [0.2, 1.8], "log_scale": False, "value_type": 'float'}
]
ax = AxClient(random_seed=42)
ax.create_experiment(name="S0 Interpolation", parameters=parameters)
trials = 250
for i in range(trials):
    print(f"Starting trial {i + 1}/{trials}")
    parameters, trial_index = ax.get_next_trial()
    ax.complete_trial(trial_index=trial_index, raw_data=interpolate(parameters))

interpolate() is a function that takes a dictionary with missing S0 as input, fills them in and calculates MAPE like this:

    errors = []
    for x, y in zip(x_list, y_list):  # x list contains dictionaries with missing S0, y contains complete dictionaries
        interpolated = f_interpolate(w1, w2, x.copy())
        for rating in [1, 2, 3, 4]:
            if rating not in x and rating in interpolated:
                error = np.abs(y[rating] - interpolated[rating]) / y[rating]
                errors.append(error)

and returns -min(np.nanmean(errors), 1e50). Minus because I couldn't figure out how to make it do minimization instead of maximization (documentation says there is a keyword for that, but nope). nanmean is to prevent optimization from stopping if there are nans.

So here's what I get with the default w1=3/5, w2=3/5:
MAPE(0.60, 0.60)=86810.1%

And here's after the Bayesian optimization with w1=1.35, w2=0.68:
MAPE(1.35, 0.68)=244.2%

355 times more accurate! And yes, I checked that it doesn't result in nans.

Expertium added a commit to Expertium/srs-benchmark that referenced this issue Jan 4, 2025
@L-M-Sherlock
Copy link
Member

L-M-Sherlock commented Jan 5, 2025

What if the stability of again is 1 day and the stability of hard is 2 days? Could you give me the interpolated results of good and easy? IIRC, both w1 and w2 should be w1 and w2 should fall within the range of 0 to 1 to avoid weird output.

@Expertium
Copy link
Contributor Author

Expertium commented Jan 5, 2025

What if the stability of again is 1 day and the stability of hard is 2 days? Could you give me the interpolated results of good and easy?

{1: 1, 2: 2, 3: 0.13801118920922661, 4: 0.03921993874306448}

Yeah, it's not monotonic. Perhaps we could make a better interpolation formula, but idk how.
Hold on, let me set the ranges to [0.1, 0.9] and see what I get via Bayesian optimization.

Ok, here's what it found
MAPE(0.41, 0.54)=375.4%
Worse than my last values, but better than the default ones.
Interpolation results: {1: 1, 2: 2, 3: 3.2375786810669958, 4: 4.879996457283174}
w1=0.41 and w2=0.54 looks good!
It's interesting that a relatively small change in w1 and w2 can reduce MAPE by more than 100 times.

L-M-Sherlock pushed a commit that referenced this issue Jan 5, 2025
* Fine-tuned w1 and w2

See #137 (comment)

* New values to avoid weirdness
@L-M-Sherlock
Copy link
Member

L-M-Sherlock commented Jan 12, 2025

We have trouble... The L2 regularization doesn't work well with recency weighting:

Model: FSRS-5-recency-dev
Total number of users: 9999
Total number of reviews: 349923850
Weighted average by reviews:
FSRS-5-recency-dev LogLoss (mean±std): 0.3256±0.1518
FSRS-5-recency-dev RMSE(bins) (mean±std): 0.0495±0.0320
FSRS-5-recency-dev AUC (mean±std): 0.7055±0.0754

Weighted average by log(reviews):
FSRS-5-recency-dev LogLoss (mean±std): 0.3508±0.1682
FSRS-5-recency-dev RMSE(bins) (mean±std): 0.0691±0.0445
FSRS-5-recency-dev AUC (mean±std): 0.7045±0.0855

Weighted average by users:
FSRS-5-recency-dev LogLoss (mean±std): 0.3541±0.1706
FSRS-5-recency-dev RMSE(bins) (mean±std): 0.0720±0.0462
FSRS-5-recency-dev AUC (mean±std): 0.7037±0.0876

parameters: [0.4226, 1.3065, 3.2448, 15.8678, 7.1831, 0.5456, 1.6395, 0.0057, 1.5183, 0.119, 1.0057, 1.9383, 0.108, 0.2961, 2.2722, 0.2305, 2.9898, 0.4832, 0.6638]

Model: FSRS-5-recency
Total number of users: 9999
Total number of reviews: 349923850
Weighted average by reviews:
FSRS-5-recency LogLoss (mean±std): 0.3260±0.1520
FSRS-5-recency RMSE(bins) (mean±std): 0.0493±0.0322
FSRS-5-recency AUC (mean±std): 0.7041±0.0765

Weighted average by log(reviews):
FSRS-5-recency LogLoss (mean±std): 0.3519±0.1691
FSRS-5-recency RMSE(bins) (mean±std): 0.0689±0.0453
FSRS-5-recency AUC (mean±std): 0.7015±0.0879

Weighted average by users:
FSRS-5-recency LogLoss (mean±std): 0.3553±0.1719
FSRS-5-recency RMSE(bins) (mean±std): 0.0718±0.0473
FSRS-5-recency AUC (mean±std): 0.7005±0.0899

parameters: [0.4314, 1.1681, 3.2702, 15.8593, 7.1329, 0.5336, 1.7704, 0.0108, 1.5127, 0.1313, 1.004, 1.9192, 0.1034, 0.306, 2.3389, 0.2307, 3.0355, 0.4536, 0.6491]

Edit: I get it. The recency weighting decreases the the average loss, so the relative penalty of L2 regularization increases. I will decrease the gamma and re-benchmark it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants