Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

AMP FTS experiment logic #6558

Open
wants to merge 1 commit into
base: main
Choose a base branch
from
Open

Conversation

bendk
Copy link
Contributor

@bendk bendk commented Jan 14, 2025

This sets it up so we always ingest the FTS data and use the SuggestionProviderConstraints passed to the query to determine how to perform the query. This seems like the simplest approach and it doesn't increase ingestion time that much. The benchmarks on my machine went from 339.29 ms to 465.60 ms.

Pull Request checklist

  • Breaking changes: This PR follows our breaking change policy
    • This PR follows the breaking change policy:
      • This PR has no breaking API changes, or
      • There are corresponding PRs for our consumer applications that resolve the breaking changes and have been approved
  • Quality: This PR builds and tests run cleanly
    • Note:
      • For changes that need extra cross-platform testing, consider adding [ci full] to the PR title.
      • If this pull request includes a breaking change, consider cutting a new release after merging.
  • Tests: This PR includes thorough tests or an explanation of why it does not
  • Changelog: This PR includes a changelog entry in CHANGELOG.md or an explanation of why it does not need one
    • Any breaking changes to Swift or Kotlin binding APIs are noted explicitly
  • Dependencies: This PR follows our dependency management guidelines
    • Any new dependencies are accompanied by a summary of the due diligence applied in selecting them.

Branch builds: add [firefox-android: branch-name] to the PR title.

@bendk bendk requested a review from 0c0w3 January 14, 2025 22:57
@bendk bendk force-pushed the push-ousuoonkusqz branch from f0276a2 to 3921dc4 Compare January 14, 2025 22:59
Copy link
Contributor

@0c0w3 0c0w3 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, I think there are some open questions but I'll go ahead and r+ to unblock you. Also, would you be able to add any tests?

components/suggest/src/db.rs Outdated Show resolved Hide resolved
Comment on lines 520 to 521
// We don't have a good way to determine the full keyword, let's just leave that
// blank for now.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You could join the full_keywords table on the suggestion ID, although there's no index on full_keywords.suggestion_id so it would probably be slow.

Desktop shows the full keyword for AMP suggestions, and I think for the different experiment branches to be comparable their UIs need to be the same. So I do think we probably need some kind of full keyword. For now, it might be enough to cheat and just show whatever's handy, like the search string itself, and in that case I think the suggest component should be the one to determine that (by setting Suggestion::Amp::full_keyword below). Wdyt? It would be fine to handle that in a follow-up.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That behavior sounds right to me. I thought that desktop would use the search string when we sent back a blank string, but in hindsight I'm not sure why I would think that.

I don't think we can do better without investing a lot more into FTS. We can get a full keyword string, but how do we know it's the right one? In the case if title matching, there might not even be a full keyword to display.

// We don't have a good way to determine the full keyword, let's just leave that
// blank for now.

self.conn.query_row_and_then(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This query and its row callback are the same as the ones in fetch_amp_suggestions_using_keywords(), right? Would it be worth factoring it out into a helper you can call from both places?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried factoring this out, but there's actually some differences because the keywords are different. Let's leave the duplicate code in for now, if we end up supporting both exact keywords and AMP, we can refactor later.

components/suggest/src/provider.rs Outdated Show resolved Hide resolved
components/suggest/src/db.rs Outdated Show resolved Hide resolved
s.provider = :provider
AND amp_fts match '{fts_column}: {keyword_lowercased}'
AND NOT EXISTS (SELECT 1 FROM dismissed_suggestions WHERE url=s.url)
"#
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we'll need an ORDER BY and LIMIT 1 for the FTS query. For non-FTS, we get away without them because AMP provides unique keywords, but an FTS query may match many suggestion rows -- at least when matching by title, I'm not as sure about full keywords, but that seems possible too.

The obvious criteria to order by is the FTS rank, but I think we need to be careful since that boils down to bm25, which is a function of term frequency. That's not very applicable here, at least when matching by full keyword since you're generating the full_keyword column in the fts table by concating all full keywords for a suggestion (if I'm reading that right). So we'll end up with a full keyword like "amazon amazon fresh amazon login amazon prime login..." for the first suggestion in data-0000-0200 for example, which will rank highly for an "amazon" search. But maybe that's actually fair? I'm not sure.

For now, maybe we could just order by suggestion ID to get a stable sort and hope that works OK. I'm not sure if there's a good answer.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great point. I'm thinking we can rank and de-dupe the keywords before inserting them. I'm not sure if that will work well or not, but it seems like it's worth a shot.

components/suggest/src/rs.rs Outdated Show resolved Hide resolved
AND k.keyword = :keyword
AND NOT EXISTS (SELECT 1 FROM dismissed_suggestions WHERE url=s.url)
"#,
if allow_keyword_expansion {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is "keyword expansion" a term that's being used on the team for this? It seems to mean that the keyword doesn't need to be a substring of the full keyword, is that right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I heard that term from Nan. My understanding is that there's a phase where they add keywords for related terms and misspellings and we can detect those keywords by the substring test.

@bendk bendk force-pushed the push-ousuoonkusqz branch from 3921dc4 to e9f0b48 Compare January 15, 2025 20:11
@bendk
Copy link
Contributor Author

bendk commented Jan 15, 2025

Thanks for the review, I made most of the suggested changes and left notes on the others. Do you want to review again?

@bendk bendk force-pushed the push-ousuoonkusqz branch from e9f0b48 to 381bae6 Compare January 15, 2025 21:05
/// misspellings like "underarmor" instead of "under armor"'.
NoKeywordExpansion,
/// Use FTS matching against the full keywords, joined together.
Fts,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about FtsAgainstFullKeywords or maybe just FtsFullKeywords? I like FtsAgainstTitle (although I might suggest simply FtsTitle) for its clarity.

/// ongoing?
/// `0` means the treatment branch.
#[uniffi(default = None)]
pub amp_alternative_matching: Option<AmpMatchingStrategy>,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice! 👍 Could you update the doc comment please? (It still mentions zero)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: Sorry, in the same spirit of renaming this struct member, would you mind rewording the doc comment so it doesn't talk about experiments? And now that it's an enum, I think it would be fine for the comment to only mention what None means (i.e., the usual exact keyword matching), if you'd like.

let parts: HashSet<_> = self.full_keywords.iter().map(|(s, _)| s.as_str()).collect();
let mut result = String::new();
for (i, part) in parts.into_iter().enumerate() {
if i == 0 {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be i > 0, right? And IIUC you're using a hash set to discard duplicate full keywords, but that's not the problem, and I'd be surprised if AMP included duplicate full keywords for one suggestion. The problem is that many of a suggestion's full keywords can contain the same substring, e.g. "amazon". So you'd need to split_whitespace() on each full keyword string and then collect each resulting word into a hash set. That should work.

You know, it might be worth preprocessing this and including the joined full keywords string in the data itself. I bet it wouldn't be too hard to modify our AMP upload script, and it probably wouldn't increase the dataset size by too much.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah my i > 0 comment is outdated, thanks.

Comment on lines +1100 to +1104
fts_insert.execute(
suggestion_id,
&common_details.full_keywords_joined(),
&common_details.title,
)?;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, just thought of something: Should this be gated on provider constraints in the ingest constraints? Most people won't be in the experiment, and ideally they wouldn't pay the cost of ingesting into the FTS table. Desktop can easily pass in some ingest constraints for users in the experiment. Maybe it's not a big deal though.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My concern with that is that we won't re-ingest records if they've already been ingested, and I couldn't figure out a good way to handle this in should_reprocess_record with the data we currently have.

It does add some time to ingestion, but not all that much. I put some metrics in the title.

Maybe we could store the provider constraints in a column in ingested_records, but that seems more like a long-term improvement to me.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems doable: If the constraints are FtsAgainstFullKeywords or FtsAgainstTitle and the FTS table is empty, then reprocess? I'm assuming it's simple to tell whether the table is empty like it is for non-virtual/non-FTS tables. I don't want to block on that though, and thanks for doing the benchmarks.

Maybe we could store the provider constraints in a column in ingested_records, but that seems more like a long-term improvement to me.

I had that thought too when I added should_reprocess_record. That would be a general, sure-fire way of reingesting, so it might be worth pursuing.

@bendk bendk force-pushed the push-ousuoonkusqz branch from 381bae6 to d0223d5 Compare January 15, 2025 21:41
This sets it up so we always ingest the FTS data and use the
`SuggestionProviderConstraints` passed to the query to determine how to
perform the query.  This seems like the simplest approach and it doesn't
increase ingestion time that much.  The benchmarks on my machine went
from 339.29 ms to 465.60 ms.
@bendk bendk force-pushed the push-ousuoonkusqz branch from d0223d5 to 1284a0d Compare January 15, 2025 21:46
Copy link
Contributor

@0c0w3 0c0w3 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, lgtm!

Comment on lines +1100 to +1104
fts_insert.execute(
suggestion_id,
&common_details.full_keywords_joined(),
&common_details.title,
)?;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems doable: If the constraints are FtsAgainstFullKeywords or FtsAgainstTitle and the FTS table is empty, then reprocess? I'm assuming it's simple to tell whether the table is empty like it is for non-virtual/non-FTS tables. I don't want to block on that though, and thanks for doing the benchmarks.

Maybe we could store the provider constraints in a column in ingested_records, but that seems more like a long-term improvement to me.

I had that thought too when I added should_reprocess_record. That would be a general, sure-fire way of reingesting, so it might be worth pursuing.

/// ongoing?
/// `0` means the treatment branch.
#[uniffi(default = None)]
pub amp_alternative_matching: Option<AmpMatchingStrategy>,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: Sorry, in the same spirit of renaming this struct member, would you mind rewording the doc comment so it doesn't talk about experiments? And now that it's an enum, I think it would be fine for the comment to only mention what None means (i.e., the usual exact keyword matching), if you'd like.

}

#[test]
fn amp_fts() -> anyhow::Result<()> {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Naming nit: amp_fts_against_full_keywords

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants