Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Search: Add approximate string matching, a.k.a. fuzzy search #1537

Open
jacobwod opened this issue Jul 9, 2024 · 0 comments
Open

Search: Add approximate string matching, a.k.a. fuzzy search #1537

jacobwod opened this issue Jul 9, 2024 · 0 comments
Assignees
Labels
idea module:backend MapService stuff plugin:search Functionality and features of the (core) Search plugin
Milestone

Comments

@jacobwod
Copy link
Member

jacobwod commented Jul 9, 2024

I'm posting this as an idea – it's not sure that it'll be implemented like this, but I'd want to discuss it, as well as when it's a "good time" to add this functionality.

Overview

One common issue when searching in Hajk is that the user’s input must exactly match a part of the result string to generate a match. Despite using wildcards and implementing some sophisticated string splitting, the results remain unsatisfactory. In fact, achieving better results might not be feasible using the WFS protocol.

I've been experimenting with various fuzzy search methods, including trigram matching. With trigram matching, I achieved a functionality that could be useful for populating the autocomplete list, aiding users in typing correctly.

The proposed solution utilizes Postgres's pg_trgm extension, which is included in a standard installation. The DBA only needs to execute a quick CREATE EXTENSION pg_trgm in the given database.

It is also crucial for the DBA to create GiST indexes on the columns used in the search operation. My tests show that with proper indexing, a search on three columns within a table with approximately 86,000 rows can take as little as 12-15 ms.


What needs to be changed

Backend

  • The Postgres client (pg package`) is added
  • .env gains some new keys that specify the Postgres connection details (the usual stuff such as host, port, database, user etc.
  • New endpoint, controller and a service for the search functionality is added. It will use the Postgres connection, but Postgres might be needed elsewhere too, so it's probably better to put the pg client part in a service of its own.

Client

Map config is extended with a section that configures the autocomplete functionality. The things we'll need is what is to be sent from the Client to the Backend, as a JSON body. Here's a description of how I imaging this object:

{
  "pgTrgmSimilarityThreshold": 0.2, // sets the pg_trgm.similarity_threshold value, refer to Postgres' docs
  "limitPerSource": 5, // How many results do we allow, PER source?
  "totalLimit": 20, // How many results do we allow, in total?
  "sources": [ // Sources array, allows us to configure different search sources per map config
    {
      "table": "adresspunkter", // table (in the database that PG connection in .env uses
      "column": "fastighet" // name of the column within specified table - ensure to have a GiST index for it
    },
    {
      "table": "adresspunkter",
      "column": "kommundel"
    },
    {
      "table": "adresspunkter",
      "column": "beladress"
    }
  ]
}

Search is rewritten to use results from the new endpoint, either exclusively or in conjunction with the current results from the WFS and DocumentHandler sources. Currently, I focus primarily on the endpoint part, creating a flexible, easy to configure solution in the backend.


So, given the setup described above, the API can have a new endpoint that will respond with meaningful results even when user mistypes the query:

Skarminspelning.2024-07-09.kl.09.13.02.mov
@jacobwod jacobwod added module:backend MapService stuff plugin:search Functionality and features of the (core) Search plugin idea labels Jul 9, 2024
@jacobwod jacobwod added this to the 3.x milestone Jul 9, 2024
@jacobwod jacobwod self-assigned this Jul 9, 2024
jacobwod added a commit that referenced this issue Jul 12, 2024
… pg_trgm to match against map config-specified sources (see #1537 for some preliminary docs). This commit uses the (official?) 'pg' package. However, afterwards another PG client library ('porsager/postgres') caught my eye and I'm thinking of refactoring. So, this is only a quick commit to save whatever's been prototyped so far. Very much WIP.
jacobwod added a commit that referenced this issue Sep 17, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
idea module:backend MapService stuff plugin:search Functionality and features of the (core) Search plugin
Projects
Status: In progress
Development

No branches or pull requests

1 participant