AMP FTS experiment logic #6558

bendk · 2025-01-14T22:57:40Z

This sets it up so we always ingest the FTS data and use the SuggestionProviderConstraints passed to the query to determine how to perform the query. This seems like the simplest approach and it doesn't increase ingestion time that much. The benchmarks on my machine went from 339.29 ms to 465.60 ms.

Pull Request checklist

Breaking changes: This PR follows our breaking change policy
- This PR follows the breaking change policy:
  - This PR has no breaking API changes, or
  - There are corresponding PRs for our consumer applications that resolve the breaking changes and have been approved
Quality: This PR builds and tests run cleanly
- Note:
  - For changes that need extra cross-platform testing, consider adding [ci full] to the PR title.
  - If this pull request includes a breaking change, consider cutting a new release after merging.
Tests: This PR includes thorough tests or an explanation of why it does not
Changelog: This PR includes a changelog entry in CHANGELOG.md or an explanation of why it does not need one
- Any breaking changes to Swift or Kotlin binding APIs are noted explicitly
Dependencies: This PR follows our dependency management guidelines
- Any new dependencies are accompanied by a summary of the due diligence applied in selecting them.

Branch builds: add [firefox-android: branch-name] to the PR title.

0c0w3

Thanks, I think there are some open questions but I'll go ahead and r+ to unblock you. Also, would you be able to add any tests?

components/suggest/src/db.rs

0c0w3 · 2025-01-15T02:39:20Z

components/suggest/src/db.rs

+                // We don't have a good way to determine the full keyword, let's just leave that
+                // blank for now.


You could join the full_keywords table on the suggestion ID, although there's no index on full_keywords.suggestion_id so it would probably be slow.

Desktop shows the full keyword for AMP suggestions, and I think for the different experiment branches to be comparable their UIs need to be the same. So I do think we probably need some kind of full keyword. For now, it might be enough to cheat and just show whatever's handy, like the search string itself, and in that case I think the suggest component should be the one to determine that (by setting Suggestion::Amp::full_keyword below). Wdyt? It would be fine to handle that in a follow-up.

That behavior sounds right to me. I thought that desktop would use the search string when we sent back a blank string, but in hindsight I'm not sure why I would think that.

I don't think we can do better without investing a lot more into FTS. We can get a full keyword string, but how do we know it's the right one? In the case if title matching, there might not even be a full keyword to display.

0c0w3 · 2025-01-15T02:42:35Z

components/suggest/src/db.rs

+                // We don't have a good way to determine the full keyword, let's just leave that
+                // blank for now.
+
+                self.conn.query_row_and_then(


This query and its row callback are the same as the ones in fetch_amp_suggestions_using_keywords(), right? Would it be worth factoring it out into a helper you can call from both places?

I tried factoring this out, but there's actually some differences because the keywords are different. Let's leave the duplicate code in for now, if we end up supporting both exact keywords and AMP, we can refactor later.

components/suggest/src/provider.rs

components/suggest/src/db.rs

0c0w3 · 2025-01-15T03:08:03Z

components/suggest/src/db.rs

+                  s.provider = :provider
+                  AND amp_fts match '{fts_column}: {keyword_lowercased}'
+                AND NOT EXISTS (SELECT 1 FROM dismissed_suggestions WHERE url=s.url)
+                "#


I think we'll need an ORDER BY and LIMIT 1 for the FTS query. For non-FTS, we get away without them because AMP provides unique keywords, but an FTS query may match many suggestion rows -- at least when matching by title, I'm not as sure about full keywords, but that seems possible too.

The obvious criteria to order by is the FTS rank, but I think we need to be careful since that boils down to bm25, which is a function of term frequency. That's not very applicable here, at least when matching by full keyword since you're generating the full_keyword column in the fts table by concating all full keywords for a suggestion (if I'm reading that right). So we'll end up with a full keyword like "amazon amazon fresh amazon login amazon prime login..." for the first suggestion in data-0000-0200 for example, which will rank highly for an "amazon" search. But maybe that's actually fair? I'm not sure.

For now, maybe we could just order by suggestion ID to get a stable sort and hope that works OK. I'm not sure if there's a good answer.

Great point. I'm thinking we can rank and de-dupe the keywords before inserting them. I'm not sure if that will work well or not, but it seems like it's worth a shot.

components/suggest/src/rs.rs

0c0w3 · 2025-01-15T03:15:31Z

components/suggest/src/db.rs

-              AND k.keyword = :keyword
-            AND NOT EXISTS (SELECT 1 FROM dismissed_suggestions WHERE url=s.url)
-            "#,
+            if allow_keyword_expansion {


Is "keyword expansion" a term that's being used on the team for this? It seems to mean that the keyword doesn't need to be a substring of the full keyword, is that right?

I heard that term from Nan. My understanding is that there's a phase where they add keywords for related terms and misspellings and we can detect those keywords by the substring test.

bendk · 2025-01-15T20:12:21Z

Thanks for the review, I made most of the suggested changes and left notes on the others. Do you want to review again?

0c0w3 · 2025-01-15T20:40:02Z

components/suggest/src/provider.rs

+    /// misspellings like "underarmor" instead of "under armor"'.
+    NoKeywordExpansion,
+    /// Use FTS matching against the full keywords, joined together.
+    Fts,


How about FtsAgainstFullKeywords or maybe just FtsFullKeywords? I like FtsAgainstTitle (although I might suggest simply FtsTitle) for its clarity.

0c0w3 · 2025-01-15T20:41:53Z

components/suggest/src/provider.rs

+    /// ongoing?
+    /// `0` means the treatment branch.
+    #[uniffi(default = None)]
+    pub amp_alternative_matching: Option<AmpMatchingStrategy>,


Nice! 👍 Could you update the doc comment please? (It still mentions zero)

Nit: Sorry, in the same spirit of renaming this struct member, would you mind rewording the doc comment so it doesn't talk about experiments? And now that it's an enum, I think it would be fine for the comment to only mention what None means (i.e., the usual exact keyword matching), if you'd like.

0c0w3 · 2025-01-15T21:00:53Z

components/suggest/src/rs.rs

+        let parts: HashSet<_> = self.full_keywords.iter().map(|(s, _)| s.as_str()).collect();
+        let mut result = String::new();
+        for (i, part) in parts.into_iter().enumerate() {
+            if i == 0 {


This should be i > 0, right? And IIUC you're using a hash set to discard duplicate full keywords, but that's not the problem, and I'd be surprised if AMP included duplicate full keywords for one suggestion. The problem is that many of a suggestion's full keywords can contain the same substring, e.g. "amazon". So you'd need to split_whitespace() on each full keyword string and then collect each resulting word into a hash set. That should work.

You know, it might be worth preprocessing this and including the joined full keywords string in the data itself. I bet it wouldn't be too hard to modify our AMP upload script, and it probably wouldn't increase the dataset size by too much.

Ah my i > 0 comment is outdated, thanks.

0c0w3 · 2025-01-15T21:19:37Z

components/suggest/src/db.rs

+            fts_insert.execute(
+                suggestion_id,
+                &common_details.full_keywords_joined(),
+                &common_details.title,
+            )?;


Sorry, just thought of something: Should this be gated on provider constraints in the ingest constraints? Most people won't be in the experiment, and ideally they wouldn't pay the cost of ingesting into the FTS table. Desktop can easily pass in some ingest constraints for users in the experiment. Maybe it's not a big deal though.

My concern with that is that we won't re-ingest records if they've already been ingested, and I couldn't figure out a good way to handle this in should_reprocess_record with the data we currently have.

It does add some time to ingestion, but not all that much. I put some metrics in the title.

Maybe we could store the provider constraints in a column in ingested_records, but that seems more like a long-term improvement to me.

Seems doable: If the constraints are FtsAgainstFullKeywords or FtsAgainstTitle and the FTS table is empty, then reprocess? I'm assuming it's simple to tell whether the table is empty like it is for non-virtual/non-FTS tables. I don't want to block on that though, and thanks for doing the benchmarks.

Maybe we could store the provider constraints in a column in ingested_records, but that seems more like a long-term improvement to me.

I had that thought too when I added should_reprocess_record. That would be a general, sure-fire way of reingesting, so it might be worth pursuing.

This sets it up so we always ingest the FTS data and use the `SuggestionProviderConstraints` passed to the query to determine how to perform the query. This seems like the simplest approach and it doesn't increase ingestion time that much. The benchmarks on my machine went from 339.29 ms to 465.60 ms.

0c0w3

Thanks, lgtm!

0c0w3 · 2025-01-16T00:19:12Z

components/suggest/src/db.rs

+            fts_insert.execute(
+                suggestion_id,
+                &common_details.full_keywords_joined(),
+                &common_details.title,
+            )?;


Seems doable: If the constraints are FtsAgainstFullKeywords or FtsAgainstTitle and the FTS table is empty, then reprocess? I'm assuming it's simple to tell whether the table is empty like it is for non-virtual/non-FTS tables. I don't want to block on that though, and thanks for doing the benchmarks.

Maybe we could store the provider constraints in a column in ingested_records, but that seems more like a long-term improvement to me.

I had that thought too when I added should_reprocess_record. That would be a general, sure-fire way of reingesting, so it might be worth pursuing.

0c0w3 · 2025-01-16T00:20:55Z

components/suggest/src/provider.rs

+    /// ongoing?
+    /// `0` means the treatment branch.
+    #[uniffi(default = None)]
+    pub amp_alternative_matching: Option<AmpMatchingStrategy>,


Nit: Sorry, in the same spirit of renaming this struct member, would you mind rewording the doc comment so it doesn't talk about experiments? And now that it's an enum, I think it would be fine for the comment to only mention what None means (i.e., the usual exact keyword matching), if you'd like.

0c0w3 · 2025-01-16T00:23:36Z

components/suggest/src/store.rs

+    }
+
+    #[test]
+    fn amp_fts() -> anyhow::Result<()> {


Naming nit: amp_fts_against_full_keywords

bendk requested a review from 0c0w3 January 14, 2025 22:57

bendk force-pushed the push-ousuoonkusqz branch from f0276a2 to 3921dc4 Compare January 14, 2025 22:59

0c0w3 approved these changes Jan 15, 2025

View reviewed changes

bendk force-pushed the push-ousuoonkusqz branch from 3921dc4 to e9f0b48 Compare January 15, 2025 20:11

bendk force-pushed the push-ousuoonkusqz branch from e9f0b48 to 381bae6 Compare January 15, 2025 21:05

0c0w3 reviewed Jan 15, 2025

View reviewed changes

bendk force-pushed the push-ousuoonkusqz branch from 381bae6 to d0223d5 Compare January 15, 2025 21:41

bendk force-pushed the push-ousuoonkusqz branch from d0223d5 to 1284a0d Compare January 15, 2025 21:46

0c0w3 approved these changes Jan 16, 2025

View reviewed changes

0c0w3 reviewed Jan 16, 2025

View reviewed changes

components/suggest/src/store.rs

}

#[test]

fn amp_fts() -> anyhow::Result<()> {

Copy link

Contributor

0c0w3 Jan 16, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Naming nit: amp_fts_against_full_keywords

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

AMP FTS experiment logic #6558

AMP FTS experiment logic #6558

bendk commented Jan 14, 2025 •

edited

Loading

0c0w3 left a comment

0c0w3 Jan 15, 2025

bendk Jan 15, 2025

0c0w3 Jan 15, 2025

bendk Jan 15, 2025

0c0w3 Jan 15, 2025

bendk Jan 15, 2025

0c0w3 Jan 15, 2025

bendk Jan 15, 2025

bendk commented Jan 15, 2025

0c0w3 Jan 15, 2025

0c0w3 Jan 15, 2025

0c0w3 Jan 16, 2025

0c0w3 Jan 15, 2025

0c0w3 Jan 15, 2025

0c0w3 Jan 15, 2025

bendk Jan 15, 2025

0c0w3 Jan 16, 2025

0c0w3 left a comment

0c0w3 Jan 16, 2025

0c0w3 Jan 16, 2025

0c0w3 Jan 16, 2025

		// We don't have a good way to determine the full keyword, let's just leave that
		// blank for now.

AMP FTS experiment logic #6558

Are you sure you want to change the base?

AMP FTS experiment logic #6558

Conversation

bendk commented Jan 14, 2025 • edited Loading

Pull Request checklist

0c0w3 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

bendk commented Jan 15, 2025

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

0c0w3 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

bendk commented Jan 14, 2025 •

edited

Loading