s3bw : Database Indexes

Mon 07 July 2025

Database Indexes

Often the most crucial part of a system is the database. Databases manage and handle the storage of our business data, an inefficient database can impact the overall performance of our system. There are, however, a few ways we can improve database performance, one such way is defining appropriate indexes.

This article describes indexes in Postgres and some of the lesser known indexes it uses, such as GIN and BRIN indexes.

Why do we use indexes.

We use indexes in postgres to optimise how we scan the data we have stored in the database. Similar to how binary search can optimise how quickly we find things in a dictionary (see my post on binary search).

We can use an index to inform the database on how to structure data in a way that allows us to find what we are looking for effectively i.e. fewer operations, fewer i/o and fewer comparisons leading to lower CPU usage and better performance.

What is an index.

An index is a data structure which allows us to minimise the number of instructions performed by the database in order to lookup table rows or improve the speed at which a result is sorted.

In the case of sorting if there's an index on a column that you require results to be ordered by, the query can scan the index to return the results in sorted order, if there is no index on the column the data must first be fetched and then sorted in a separate step.

Indexes aren't always good

Over doing the number of indexes on a table can lead to lower write performance as every update or insert into the table requires updating the index. Indexes also require disk space, so having unnecessary indexes will contribute to the rate at which your disk usage grows.

B-Tree

The default index type Postgres uses is a B-Tree. This data structure is similar to a binary tree except instead of having a node with only two pointers a node can contain many pointer. The number of pointers is determined by Postgres's default block size, which is 8kb. A single node will store as many sorted keys as it can until it reaches this size limit. Postgres refers to these nodes as pages.

Having a smaller index keys can improve the performance of your index as you'll be able to store more keys within the page limit. Resulting in fewer i/o instructions as fewer pages are required to be read from disk.

In postgres leaf nodes will point to the physical location of the row on disk and the intermediary nodes (or internal pages) will point to the nodes on the next level down the tree.

There's a neat animation that can help visualise b-trees: https://btree.app/

Partial Indexes

We can optimise the indexes we create by understanding usage patterns and having some intuition about the system and the business. One such optimisation is the use of partial indexes, these are handy when you have a large table but only a subset of the data is used most frequently.

As an example, we can imagine an online order system, it's quite likely that once an order is fulfilled the status transitions to "complete" and this order isn't likely to be accessed as frequently as the orders that are still in progress. We can partition our index so that it contains only unfulfilled orders which will be a significantly smaller portion of our overall orders.

CREATE INDEX idx_uncomplete_orders_customer ON
orders(customer_id) WHERE status != 'complete';

We also have to include this WHERE filter in our queries if we wish to make use of this index.

Covering Index

A covering index allows us to add columns to the index's leaf nodes and thus avoids the lookup and reading of the entire the table row from disk. Essentially everything that the query needs is available on the index, reducing the number of operations required to complete the query.

As an example if we typically request a user's first name and last name with an email we can write an index on the email that includes the first and last name.

CREATE INDEX idx_user_email_include
ON users (email) INCLUDE (firstname, lastname);

We have to bare in mind that we should only include columns which change as frequently or less frequently than the key we are indexes otherwise we are duplicating data and increasing the write overhead. This isn't an issue for columns that rarely change.

Gin Index

GIN stands for Generalised Inverted Index, similar to the inverted index that Lucene is built on, the same Lucene that Elasticsearch is built on. The slight difference is that this inverted index is generalised so it expects items that are indexed to be composed of multiple values the same way a sentence is composed of multiple words, with the goal to support full-text search.

An index entry is created for each element in the composite column and the entry points to a list of matching locations. So as an example a row that has a column with the value "Here's is a blog post" will create an index entry for each word in the value, (e.g. "blog") and then each entry will point to the rows that contain "blog" inside the composite column.¹

ALTER TABLE blogposts
ADD COLUMN search_vector tsvector;

CREATE INDEX idx_blogposts_search ON blogposts
USING GIN (search_vector);

GIN indexes aren't limited to just text, they are generalised so they can be used with any composite type, JSONB for example.

BRIN Index

BRIN stands for Block Range Index. These are indexes that specialise in handling very large tables in which certain columns have some natural correlation to their physical storage location. Essentially if each row in the table can be grouped in some manner to the blocks on disk then we have a potential use-case for the BRIN Index.

GPS tracking points and log tables are good examples of data with this natural correlation, if you have a table that is small or data is updated or inserted out of chronological order then they won't form a good fit for BRIN indexes.

Instead of storing pointers to entry locations in the B-Tree leaf nodes, BRIN indexes store summary information for the ranges that the blocks on disk are ordered. This allows for a smaller overhead compared to the default B-Tree which stores all entries and their corresponding locations.

CREATE INDEX idx_transactions_created_at_brin
ON transactions using BRIN (created_at);

This can be used to optimise queries that rely on fetching rows between certain ranges.

SELECT * FROM orders WHERE
created_at BETWEEN '2024-01-01' AND '2024-12-31';

I'm describing a reverse index. ↩