Full Text Search using tsvector with Neon Postgres
A step-by-step guide describing how to implement full text search with tsvector in Postgres
The tsvector
type enables you to use full text search on your text content in Postgres. Full text search allows you to search text content in a more flexible way than using LIKE
. Full text search also supports features like stemming, which means searching for the word "run" will match variations like "ran" and "running".
Steps
- Set up a table with a
tsvector
column - Execute your first full text search
- Search for multiple words
- Rank the results
- Create a GIN index
tsvector
column
Set up a table with a To set up full text search, you need to create a column of type tsvector
that will enable full text search. You can run the following CREATE TABLE
statement in the Neon SQL Editor or from a client such as psql that is connected to Neon. This statement will create a table with a column searchable
of type tsvector
.
Next, insert two new rows into the documents
table. The [to_tsvector()] (https://www.postgresql.org/docs/current/textsearch-controls.html) function takes in a language and the text content to tokenize. In the following example, the text content is the title
and body
columns concatenated together.
Once you have inserted the new rows, try running SELECT * FROM documents;
. You will see that the data stored in the searchable
column looks like the following.
Internally, to_tsvector()
uses a parser to break the text content into tokens for easier searching.
Execute your first full text search
You can execute a full text search query using a WHERE
clause with @@ to_tsquery('english', 'content here')
as shown below. The following query returns the "mashed potatoes" row because, although the word "flavorful" does not appear in that row's text content, the word "flavor" does. And "flavor" matches "flavorful".
Similarly, the following query returns the "PostgeSQL" row, even though the word "searching" does not appear in that row's text content, but "search" does.
The @@
operator is a special operator which compares the tsvector
value stored in the searchable
column with the tsquery
value provided in the query. The tsquery
type is different from the tsvector
type. For example, if you run the following command, Postgres will return the string "searching".
Search for multiple words
If you try using to_tsquery()
with multiple words, like to_tsquery('english', 'searching text');
, Postgres will throw the following error.
That's because the input to to_tsquery()
must be tokens separated by tsquery
operators like &
. The correct way to search for "searching" and "text" would be to_tsquery('english', 'searching & text');
. To make full text search easier to work with, Postgres also has a phraseto_tsquery()
function that converts text into a tsquery
with no need for operators. The following query will successfully return the "PostgreSQL" row.
tsquery
also supports negations. For example, the following query will search for rows whose text content matches "searching" and does not match "text".
Postgres also supports a websearch_to_tsquery()
function, which uses an alternative syntax that doesn't require putting operators between all tokens. websearch_to_tsquery()
supports negations by prefixing a token with -
. The following query also searches for rows whose text content matches "searching" and does not match "text".
Rank the results
Postgres provides two functions for ranking the results, allowing you to sort by which results are the best match. The following statement sorts rows that match "searching text" using the ts_rank()
function, which counts the number of tokens that match.
To see how sorting works in practice, insert two more rows as follows. The first row contains 6 tokens that match "search" and "text", so that row should show up first.
Running the ts_rank()
SELECT
statement with these two new rows outputs rows in the following order. "PostgreSQL Text Search" appears first because it has the most occurrences of tokens that match "search" and "text".
Postgres also has a ts_rank_cd()
function which uses an alternative ranking algorithm based on cover density. ts_rank_cd()
also takes proximity of matching tokens into consideration, so the "PostgreSQL Text Search" row will rank slightly lower with ts_rank_cd()
because there's more words between the matching tokens.
Create a GIN index
GIN indexes allow you to index your tsvector
properties, which can make your full text search queries faster as your data grows. Just be careful, GIN indexes can slow down your updates. Below is how you can create a GIN index on the searchable
column.
To test out the GIN index, let's first insert 100 copies of the "mashed potatoes" document. Sometimes Postgres decides to skip using indexes and use a sequential scan instead when a query matches most of the table.
Next, you can run an EXPLAIN ANALYZE
query (or just click the Explain button in the Neon SQL Editor) to confirm that Postgres is using your GIN index.
The EXPLAIN ANALYZE
query should produce output that resembles the following. The Bitmap Index Scan on searchable_idx
means that Postgres is using a GIN index rather than a sequential scan to answer the query.