Skip to content

12. Search Capabilities

Sabriel567 edited this page Apr 24, 2015 · 12 revisions

Database Search Logic

Instead of using another technology such as elasticsearch we decided to use a tool provided by Postgres that allows selected columns to be vectorized. The idea is every time a row is updated you vectorize it's columns into a column containing TSVector. This vector can then be queried against using methods described below. For simplicity's sake we vectored columns produced based on a join of the data and stored in the "complete" table. This is possible because there would not be many updates, adds, or deletes in the database. If the data had been more dynamic other options would need to be explored. Since the data is very static the search vector is also indexed using gist indexing which is popular for indexing on a large database of static data.

Flask Integration

In order to integrate the Postgresql search into Flask, we had to alter our way of contacting the database for the search endpoint. This change took the form of a database engine call. The engine call is quite different than a normal SQLalchemy call. When calling with SQLalchemy, one uses predefined methods that will construct the written SQL at a later time; however, with this alternative technique, one makes a connection to the database and passes in the handwritten SQL query via an execute method.

A basic example:

database.connect().execute('SQL String')

The main reason for the deviation from the norm was because of the complex nature of the search query, with most of the complexity stemming from the use of Postgresql's built in search capabilities, Downing's demands of 'or' and 'and' results, and the highly desired format of the results.

Python Magic

Preprocessing

According to the requirements, the results must include both 'and' and 'or' results separately for the same search. Thus, before the SQL call is made, a parsing of the search terms is in need.

On a basic level, all the terms must have their separating spaces elongated enough in order to squeeze in either an '&' symbol or an '|' symbol between them and the terms i.e. "charles aquatics" becomes "charles & aquatics" and "charles | aquatics".

This parsing is achieved by splitting the string by any number of spaces and removing any empty strings that may be left behind from rogue spaces. Afterwards, it is easy enough to join each element of the newly created list of terms by " & " and " | ".
The code except below shows how to do it in python.

term_list = list(filter(lambda x: x != '', regex_split('[ ]+', search_terms)))
or_search  = ' | '.join(term_list)
and_search = ' & '.join(term_list)

Defining Results Schema

For the search results page one needs the following information:

  • Category the result entry falls under
    • Athlete/Sport/Event/Year/Country
  • The result entry that matched the search terms
  • The terms that the result entry matched
  • Some text defining why that particular result entry matched the search terms

For instance, the results for a search for "Aquatics" may look as follows:

AND
Athletes

  Charles Buchanan Hickcox
    matched: "Aquatics"
    Charles Buchanan Hickcox (United States) was in 100m backstroke (Aquatics) in Mexico 1968

  Ronald P. Mills
    matched: "Aquatics"
    Ronald P. Mills (United States) was in 100m backstroke (Aquatics) in Mexico 1968

  Roland Matthes
    matched: "Aquatics"
    Roland Matthes (East Germany) was in 100m backstroke (Aquatics) in Mexico 1968

Years

  1968
    matched: "Aquatics"
    Charles Buchanan Hickcox (United States) was in 100m backstrok (Aquatics) in Mexico 1968
    Ronald P. Mills (United States) was in 100m backstroke (Aquatics) in Mexico 1968
    Roland Matthes (East Germany) was in 100m backstroke (Aquatics) in Mexico 1968

etc. for the other categories.

Implementation

After calling the search execution method, we are given a table of data to be parsed into the specified format, with each row containing a datum for each category. Since it is a lot of data, we have the database help do some of the work by specifying which boolean operator is the row a part of and by bundling related values into arrays within each row,
e.g. ['and', [athlete_name, athlete_id], [sport_name, sport_id], ... ].

This bundling allows for the zipping of each row with a predefined list of strings, representing the category names to be displayed in the search results. This zipping is needed for the algorithm to know which array should be displayed under which category,
i.e. [('Athlete', [athlete_name, athlete_id]), ('Sport',[sport_name, sport_id]), ... ] denotes that each array belongs to that category.

Since Postgesql's search will put tags around strings that matched the search terms, we can derive which terms matched each row of the results by looking for the bold tags while parsing each row.

Lastly, each entry for each category needs to remember all the data in its individual row for the purpose of displaying why that entry showed up in the results. This is necessary mostly for the seemingly odd entries,
e.g. A search is made for "Aquatics" but a year showed up in the results (it's because one or more athletes won a medal for aquatics in that year).

With these things in mind, here is the process:

  1. Loop through all the results
  2. Zip the categories with the row
  3. Find out whether a result for "and" or "or"
  4. Save off all the data from the row
  5. Make an empty list to hold all the matched terms
  6. Loop through all columns/categories of the row
  7. Save off any column/term that satisfied the search to matched terms list
  8. Add every entry/column to its denoted category and appropriate boolean category as a dictionary, including a copy of the terms matched and the row data
  9. Upon finding a duplicate entry for a particular category, just append the row's data to the entry's other row data lists, giving it another reason as to why it was included