PG Text Search: Unlock BM25 Ranking in PostgreSQL for AI & RAG Developers | Brav

TL;DR

Table of Contents
  • BM25 ranking is now native to PostgreSQL through PG Text Search.
  • No external search engine needed – everything runs inside your database.
  • Combine keyword search and semantic vectors with PG Vector for a hybrid search.
  • The BM25 index updates automatically as you insert, update, or delete rows.
  • Try PG Text Search on Tiger Data’s free cloud platform today.

Why this matters

I used to rely on Postgres’ built-in full-text search for my RAG prototype. Every time I added a new article, I had to rebuild a manual ranking table. Long documents would swamp the top of the results, and the search would fail if any keyword was missing. For a conversational chatbot, that was a nightmare. PG Text Search solves all those headaches in one extension, so I can focus on building better agents instead of chasing search bugs.

Built-in Postgres full-text search uses tsvector and tsquery types, but it offers only boolean matching and a simple ranking with ts_rank. It lacks key signals such as IDF and length normalization, so the ranking is brittle and long documents often dominate the output. When I migrated to PG Text Search, I instantly saw sharper relevance and the ability to tune k1 and b parameters. Because it runs inside the database, I no longer maintain a separate search engine or sync data between services.

Search quality directly determines the output quality of retrieval-augmented generation (RAG) systems. If the retrieved context is noisy or irrelevant, the large language model tends to hallucinate or give vague answers. A recent study showed that improving retrieval quality increases answer accuracy by up to 30 % in a customer-support chatbot. (Milvus Blog)

Core concepts

Postgres stores searchable text as a tsvector column, which is a sorted list of lexemes. A query is expressed as a tsquery. The database can index the tsvector with a GIN index, giving fast keyword matches. However, ranking is limited to ts_rank or ts_rank_cd, which do not use IDF or length normalization. (PostgreSQL Docs)

The BM25 algorithm

BM25 is a family of ranking functions that combine:

  • IDF (inverse document frequency) – rare terms get higher weight.
  • Term frequency saturation – a document is not over-penalized for repeating a keyword.
  • Length normalization – longer documents are not automatically favored. The formula is: BM25(D,Q) = Σ over terms t in Q of IDF(t) * ((f(t,D)(k1+1)) / (f(t,D)+k1(1-b+b*|D|/avgdl))). (Wikipedia)

PG Text Search implements BM25 natively, giving you a BM25 index that automatically syncs with your table and updates on every INSERT, UPDATE, or DELETE.

PG Vector is a separate extension that stores dense vectors and allows k-nearest-neighbor searches. It is fuzzy: the closer the vector, the higher the similarity score. Vector search gives semantic understanding, but it can be noisy.

PG Text Search can be paired with PG Vector for hybrid retrieval. By running a keyword BM25 query in parallel with a vector similarity filter, you get the precision of keyword search and the coverage of semantic search. TigerData’s documentation explains how to combine them with a simple SQL WHERE clause. (TigerData Docs)

How to apply it

Below is a minimal recipe that I use daily. I spin up a free instance on Tiger Cloud, enable the extensions, and query with BM25. Feel free to copy-paste into your own session.

-- 1. Enable the extensions
CREATE EXTENSION IF NOT EXISTS pg_textsearch;
CREATE EXTENSION IF NOT EXISTS pgvector;  -- optional, for hybrid

-- 2. Create a table
CREATE TABLE documents (
  id bigserial PRIMARY KEY,
  title text,
  body text
);

-- 3. Insert a few demo rows
INSERT INTO documents (title, body) VALUES
  ('Wireless charging', 'Wireless charging allows you to power devices without cables, providing convenience and faster charging times.'),
  ('Gaming keyboard', 'A gaming keyboard features mechanical switches, RGB lighting, and programmable macro keys for competitive gaming.');

-- 4. Create a BM25 index
CREATE INDEX docs_bm25_idx ON documents USING bm25(body) WITH (text_config='english');

-- 5. Query with BM25
SELECT id, title, body,
       body <@> 'wireless charging' AS bm25_score
FROM documents
WHERE body <@> 'wireless charging'
ORDER BY bm25_score ASC
LIMIT 10;

The operator <@> returns a negative BM25 score; the lower (more negative) the value, the better the match.

Adding hybrid semantic scoring

-- Assume we have a vector column ***vec*** in the table
SELECT id, title, body,
       body <@> 'wireless charging' AS bm25_score,
       body <#> vec AS vector_score
FROM documents
WHERE body <@> 'wireless charging'
  AND body <#> vec
ORDER BY bm25_score ASC, vector_score DESC
LIMIT 10;

Running on Tiger Cloud

# Create a free service
tiger service create --name mydb
# Connect via psql
tiger db connect
# Run the SQL script above

The free tier offers 10 GB of storage and stops after a few hours of inactivity. (TigerData Docs)

Comparison table

Search methodWhat it doesLimitation
Postgres native full-textKeyword matching with ts_vector / ts_queryNo BM25 ranking, no IDF or length normalization
PG Text SearchBM25 ranking, automatic index, integrated with SQLRequires extension installation
ElasticsearchExternal engine with powerful ranking and aggregationRequires separate cluster, extra infra

Pitfalls & edge cases

  • Index size: BM25 indexes grow with the number of tokens. For millions of documents, consider partitioning or limiting the indexed columns.
  • Write workload: Each INSERT, UPDATE, or DELETE triggers a reindex. For heavy write traffic, evaluate the impact on latency.
  • Parameter tuning: The default k1 = 1.2 and b = 0.75 work well for general use, but short documents may benefit from a lower k1, while very long documents may need a higher b.
  • Hybrid scoring: When combining BM25 and vector similarity, you must decide how to weight each score. A simple approach is to sort by BM25 first, then by vector similarity, but more advanced blending can improve relevance.
  • Vector quality: The benefit of hybrid search depends on the quality of your embeddings. Low-quality vectors can degrade overall precision.
  • Concurrency: The BM25 index is transactional; concurrent writes are safe, but heavy concurrent reads on a large index can still saturate CPU. Use connection pooling and read replicas if needed.

Quick FAQ

  1. What is BM25 and why does it matter for search quality? BM25 is a probabilistic ranking function that uses IDF, term frequency saturation, and length normalization. It is the industry standard for search ranking, giving more relevant results than plain keyword matching. (Wikipedia)

  2. How does PG Text Search differ from Elasticsearch? PG Text Search runs inside PostgreSQL, so you don’t need a separate search cluster. It offers BM25 ranking and automatic index synchronization, but lacks some of Elasticsearch’s advanced aggregation and faceting features. (TigerData Docs)

  3. Can I use PG Text Search on a self-hosted PostgreSQL? Yes. The extension is open source and can be installed on any PostgreSQL 17 or newer instance. Just run CREATE EXTENSION pg_textsearch;. (Timescale/pg_textsearch)

  4. How do I combine keyword and vector search? Install both pg_textsearch and pgvector, index the text column with BM25, store embeddings in a vector column, and combine the <@> operator with <#> in a single query. (TigerData Docs)

  5. What are the performance implications for very large datasets? BM25 indexing is efficient, but the index size can grow large. For millions of rows, consider partitioning, using fewer indexed columns, or tuning k1 and b. The extension’s documentation suggests monitoring index size and query latency. (Timescale/pg_textsearch)

  6. How do updates or deletions affect the BM25 index? The index updates automatically and transactionally. Every row change is reflected in the index at commit time, so the search results stay consistent without manual maintenance. (Timescale/pg_textsearch)

  7. Is there an SLA for the free tier of TigerData’s cloud service? The free tier offers no charge for idle capacity and stops after a few hours of inactivity, so you’ll need to keep the service running for production workloads. (TigerData Docs)

Conclusion

PG Text Search gives PostgreSQL a modern, BM25-based search engine without leaving the database. For developers building RAG systems, this means sharper retrieval, fewer moving parts, and the ability to experiment with hybrid keyword-plus-semantic search in a single SQL query. Try the free Tiger Cloud instance today, drop your corpus into a table, and watch the relevance climb.

References

  • Timescale/pg_textsearch — PostgreSQL BM25 Extension (2024)
  • TigerData Docs — PG Text Search Extension (2024)
  • TigerData Blog — Introducing pg_textsearch (2024)
  • PostgreSQL Docs — Text Search Types (2024)
  • Wikipedia — Okapi BM25 (2024)
  • Milvus Blog — RAG Retrieval Quality (2024)
  • PostgreSQL News — pgvector 0.5.0 Release (2023)
  • Timescale/tiger-cli — Tiger CLI Extension (2024)
  • TigerData Blog — Five Features of Tiger CLI (2024)
Last updated: January 23, 2026

Recommended Articles

Unlocking the Invisible Internet: How I2P Lets Me Browse Censorship-Resistant Sites | Brav

Unlocking the Invisible Internet: How I2P Lets Me Browse Censorship-Resistant Sites

Learn how to install i2pd, configure LibreWolf, host a private Nginx site, and navigate the invisible internet with step-by-step instructions.
Unlock a 45% Close Rate: Master the Three Buyer Types and Adapt Your Pitch in 5 Minutes | Brav

Unlock a 45% Close Rate: Master the Three Buyer Types and Adapt Your Pitch in 5 Minutes

Learn to spot three buyer types in five minutes and tweak your pitch to lift close rates to 45%, shorten sales cycles, and build trust with HighLevel automation.