s3bw : Low Latency Computation

Mon 30 June 2025

Low Latency Computation

Anything that loads faster than 200ms is instantaneous for humans, anything slower and we would perceive the delay. Some systems are built to respond quicker than this, business cost can be sensitive to latency and every ms can make a difference.

How do we develop highly performant systems? I happened to share a train ride with Mark Shannon, who at the time was leading a team at Microsoft that had one goal: Make Python Fast.

I asked him, how does one make code more performant? To which he responded.

Make it do less.

Here's what doing less looks like:

Loop Invariant

A simple example of doing less in order to achieve better performance is with loop invariant conditions.

We loop over arrays while we code and make calls to other functions while looping. During a refactor or while we move code around we might not realise that some of the variables we are computing are constant.

def expensive(trades: list[Trade], n: int):
    rolling_sum = 0
    for trade in trades:
        beta = compute_beta(n)
        rolling_sum += beta * trade.risk

    return rolling_sum / len(trades)

You'll notice in the example that I am looping over the list trades and computing a beta value, but this beta value is computed with the integer n which is consistent within the scope of this function. If we move the computation of beta to a point before we've entered loop we avoid having to compute beta for each trade in trades.

def expensive(trades: list[Trade], n: int):
    rolling_sum = 0
    beta = compute_beta(n)
    for trade in trades:
        rolling_sum += beta * trade.risk

    return rolling_sum / len(trades)

Now beta is computed at most once within this method scope, we have essentially achieved doing less. If the compute_beta method incurs an expensive overhead and we are looking at a lengthy trades list this subtle oversight would have a significant impact on our latency.

Python

There are several layers of abstraction in Python and there are ways we can code that allow us to utilise computation in C more effectively. One such example of this: using list comprehension over a vanilla for-loop.

If we wish to understand the number of steps the machine is doing for a method, Python allows us to disassemble bytecode using import dis. You might notice that there are additional LOAD_ATTR instructions in a vanilla for-loop so defaulting to a list comprehension avoids this overhead and allows us to do less.

Databases

Relying on external systems can be expensive, in some cases we might query the same system several times in order to complete an operation. If we have a distributed system or intend to scale our server horizontally we are increasing the number of things that open and close connections to the database. Each time we open/close a connection the CPU has to process these instructions, this overhead can be significant when we are working in low latency environments. To get around this we introduce a connection pool between the database and our server, the connection pool manages long lived connections on the database and frees up the database to focus on database things.

A common anti-pattern involving a database is having your ORM make N+1 queries. For this to occur we have a 1-many entity relationship. If we ask the database to return a group of n countries and then for each country we ask for all cities in that country we are making N (number of countries in our initial query) + 1 (the initial query) total requests to the database. These can be tricky to spot since the individual query could be performant, but the accumulative overhead and latency of all these queries can cause problems.

We get around this by either asking for all countries and then ask for all cities given a set of country ids and perform the mapping in memory on the server, or we can increase the complexity of the query using a JOIN and allow the database to make the mapping. Either way we avoid the back and forth overhead of making N+1 queries, effectively doing less.

When we can't do less

Sometimes we can't do less, but we can decide when work is done. There are situations that allow us to pre compute results and provide performant computation at the cost of a slower start time.

Chess programming is an area which relies on high performance, the quick our program the further into the future we can look on each move during the game. One way we speed up a chess engine is by precomputing the movesets for every piece on the board for every possible position they might be in before the game starts. This allows us to look up all possible board positions for a piece during the game instead of computing only when we need it.¹

In 3b1b's wordle solver he actually computes a pattern matrix for each word and saves it to a file in order to avoid having to incur the overhead on each execution of his wordle solver.

Constant Folding

Lastly there's an interesting optimisation that compilers make which is also utilised in python called Constant Folding.

When python is interpreted it parses the AST before executing. During this AST construction it finds expressions who's outcome is constant and computes the constant at this point instead of computing the expression every time it is referenced. As an example the variable x = 24 * 60 will be converted to 1440 before compiling the bytecode instructs for the machine.

So in all cases, if we want a computer to be highly performant we just have to make it do less.