Using PostgresQL’s search capabilities (tsvector etc.) is a well-known low-key alternative to using a full blown search engine, like Elasticsearch/Solr/Algolia. But it turns out, it is also quite easily usable as a “reverse search engine”, what Elasticsearch calls “Percolate”. One use case could be user supplied “Subscriptions”, think of “search subscriptions” of classified that you want to trigger whenever a new item reaches the database.
After fiddling around one day with Elasticsearch percolators I was not so satisfied with the results, that I decided to trying PG’s tsvector/tsquery.
Recently, I’ve wanted to build a keyword tagger, that given a blob of text uses a predefined vocabulary database to give you the “most relevant” tags for that tags. The process should be absolutely the same as with search subscriptions by reversing Query (queries are stored in the DB) and Documents (only one document: the document to match with the queries). For this example I will use keyword
matching as we implemented it in Empfehlungsbund in various places:
- Having a table
keywords
with a string columnkeyword
, e.g. a managed table of most important keywords for your domain. In our cases that includes all the technolgy terms of developers/administrators (like “Javascript” or “Ruby on Rails”) - there is also a little hierarchie of keywords in “competence groups” and we also add the “search relevance” which expresses how often this term is used by Job Seekers on our job platforms of Empfehlungsbund.
1. Adding TS-columns
- We add a tsquery column
keyword_query
and a normalized tokenkeyword_search_token
CREATE TABLE public.keywords (
id bigint NOT NULL,
keyword character varying,
keyword_search_token character varying,
keyword_query tsquery,
);
keyword_query
if you follow several guides about PG’s fulltext search, I never came across one that proposed to just use a tsquery
instead of a tsvector
. But in our case (percalote) this makes total sense to add the string as tsquery
.
keyword_search_token
is a normalized version of our query, a processing pipeline that we ran before saving a record AND before tagging a document on that document later on. In our case the Pipeline consists of:
- Remove All HTML,
- normalize collocations (“java-developer” and “java developer”)
- Remove common (German) stop words,
- REPLACE important special chars that our search engine will need, like “.” (‘.NET’, “Vue.js”) or prefix/suffix ‘#’, ‘+’ (C++, C#). Many search engines does get this not right and strip out all special chars by default.
Here a Ruby snippet from our codebase:
STOPWORDS = YAML.load_file('config/stopwords.yml')
WORD_BOUNDARY = '(^|\s|[,\.\-\!\?])'.freeze
REPLACE_REGEX = /#{WORD_BOUNDARY}(#{STOPWORDS.join('|')})#{WORD_BOUNDARY}/i.freeze
def preprocess_query(text)
ActionController::Base.helpers.strip_tags(text).
tr('-', ' ').
yield_self { |s| replace_special_chars(s) }.
gsub(REPLACE_REGEX, ' ').
gsub(%r{\([mwd/ ]+\)}, '').
gsub(/\s+/, ' ').
strip
end
def replace_special_chars(string)
string.gsub(/[#\+\.]/, '#' => "RAUTE", '+' => 'PLUS', '.' => 'DOT')
end
2. Adding a tsquery column and trigger
Now, we want to build a trigger that manages the keyword_query conversion automatically. We can to this inline by splitting the words by whitespace and joining them back together with the infix “<->” operator, which means “the words on either side of the operator should be near together”. (Introduced in PG 9.6). This is extremely useful for searching for collocations like “Ruby on Rails”
CREATE FUNCTION public.keyword_results_before_insert_update_row_tr() RETURNS trigger
LANGUAGE plpgsql
AS $$
BEGIN
new.keyword_query := to_tsquery(array_to_string(string_to_array(new.keyword_search_token, ' '), ' <-> '));
RETURN NEW;
END;
$$;
CREATE TRIGGER keyword_results_before_insert_update_row_tr BEFORE INSERT OR UPDATE ON public.keyword_results FOR EACH ROW EXECUTE PROCEDURE public.keyword_results_before_insert_update_row_tr();
2a. Rails tsquery migration
If you are using Rails you can use hairtrigger
Gem to more easily create the trigger in a migration:
add_column :keyword_results, :keyword_search_token, :string
add_column :keyword_results, :keyword_query, :tsquery
add_index :keyword_results, :keyword_query, using: 'gist'
create_trigger(compatibility: 1).on(:keyword_results).before(:insert, :update) do
"new.keyword_query := to_tsquery(array_to_string(string_to_array(new.keyword_search_token, ' '), ' <-> '))"
end
BUT: ActiveRecord’s PG-Adapter at this moment does not know at the moment to handle tsquery
columns and will fail with:
unknown OID 3615: failed to recognize type of 'keyword_query'. It will be treated as String.
And the table will be missing from db/schema.rb
, and thus not loaded in test, with this comment:
# Could not dump table "keywords" because of following StandardError
# Unknown type 'tsquery' for column 'keyword_query'
The normal solution which you find for this class of problem is generally to switch to “SQL”-mode of Schema dumper. I don’t like that, because it makes the diffs gigantic and merges/rebases fail more often. I found, I could just tell Rails via patching how to handle that problem:
# config/initializers/active_record_pg_types.rb
require "active_record/connection_adapters/postgresql_adapter"
module PGAddTypes
# 1st place: how to handle this column = it's a string!
def initialize_type_map(m = type_map)
super
m.register_type "tsquery", ActiveRecord::ConnectionAdapters::PostgreSQLAdapter::OID::SpecializedString.new(:tsquery)
m.register_type(3615, ActiveRecord::ConnectionAdapters::PostgreSQLAdapter::OID::SpecializedString.new(:tsquery))
end
end
# 2nd place:
ActiveRecord::ConnectionAdapters::PostgreSQLAdapter::NATIVE_DATABASE_TYPES[:tsquery] = { name: "tsquery" }
ActiveRecord::ConnectionAdapters::PostgreSQLAdapter.prepend PGAddTypes
# 3rd place: mapping a schema / migration line to a database mapping
module ActiveRecord::ConnectionAdapters::PostgreSQL::ColumnMethods
def tsquery(*args, **options)
args.each { |name| column(name, :tsquery, options) }
end
end
With this code in place, schema-dumping and loading will work again!
3. Reverse searching against table
preprocessed_query = preprocess_query(params[:query])
# create intermediate from table with our keywords and the preprocessed full_text:
from = KeywordResult.sanitize_sql_array ['keywords, to_tsvector(?) full_text', preprocessed_query]
KeywordResult.
# where: having a match
where(%{full_text @@ tsv_body}).
from(from).
order('score desc').
# select: all normal columns + a "score"
select("keywords.*, ts_rank(full_text, tsv_body) as score")
In SQL:
SELECT keywords.*, ts_rank(full_text, tsv_body) as score
FROM keyword_results, to_tsvector('Tag this string with keywords like Ruby on Rails') full_text
WHERE (full_text @@ tsv_body)
ORDER BY score desc
(Of course, you should NOT pass the string in raw like this but use prepared statements with a placeholder variable)
Tips
Improving the accuracy by scoping
If you have a large spectrum of keywords there is the problem that some words can mean different things for different audiences. For example, when parsing a job ad or a applicant CV, imagine you find the word “Amazon” - For an IT person that probably is related to “AWS”, for a Retail staff maybe the Amazon Ad or Shops. Or think of generic greek/latin names for all kind of technologies (“Atlas”, “Prometheus” etc.) which could have all kind of meanings in different contexts, or roles like “Architect” or “Designer”. In our case we add the general sector to a keyword, like “IT” (programmer, architects, admins, tech designers), “Engineering”, “medicine”, “white-collar office” (Marketing, Sales, HR, Accounting) and similar. In a first step we first find, which of the Tags would have the most direct keyword matches and then only search for those keywords in this sector.
Suggest new keywords
As mentioned in the beginning, we have a 2 tier hierarchy of our keywords. Each keyword belongs to a “competence” (E.g. “Ruby on Rails”, “RSpec” belongs to “Ruby”). To suggest keywords we just take the keywords found, go up to the competence and then take the most relevant N keywords ordered by the “search popularity”.