Using ULIDs at incident.io

Inspired by PlanetScale’s post about Why we chose NanoIDs for PlanetScale’s API, I wanted to share my experience of using ULIDs (Universally Unique Lexicographically Sortable Identifier) at incident.io.

It’s worth noting ULIDs had been chosen before I ever arrived at incident, so I can’t speak to the motivation. But I think our implementation will be the most common, especially for those using Postgres, so our experience can be useful to people trying similar things.

What is a ULID?

ULIDs are an alternative to the official UUID formats and while they look similar, are constructed to avoid several pitfalls of a standard UUID.

An example ULID is: 01ARZ3NDEKTSV4RRFFQ69G5FAV.

Borrowing the diagram from the official ULID spec, it is constructed by concatenating a timestamp with a random suffix:

 01ARZ3NDEK      TSV4RRFFQ69G5FAV

|----------|    |----------------|
 Timestamp          Randomness
   48bits             80bits

This provides several nice properties:

ULIDs are more compact and easy to copy-and-paste, unlike hyphenated UUIDs (e.g. 04dbe983-a9af-4776-9ca9-fd10e5a80636).
Timestamp prefixing means ULIDs are Lexicographically sortable (more of this later).
The random component has enough entropy to avoid collisions in practical uses.

In addition to those benefits, a big reason to adopt ULIDs was as an alternative to UUIDs in relational databases, which could case performance issues (see 1, 2).

Specifically, when using the default B-tree index implementations in most relational databases, UUIDs which sort randomly could cause index fragmentation which leads to index bloat. The fragmentation means temporally related rows would be spread across the index impacting query performance, while bloat increases the size of the index on disk for the same amount of entries.

You could try working around this with hash indexes which aren’t ordinally sensitive, but support was limited. Using Postgres as an example, it wasn’t until Postgres 10 (2017) that hash indexes were replicated, which ruled out uses in any clustered setup.

So lot’s of advantages, but what’s it like using them in production?

How we use ULIDs

At incident.io, we vendor geckoboard/pgulid to add a generate_ulid() function to our database schema.

Then we create most of our tables like this:

create table incidents (
  id text primary key default generate_ulid() not null,
  organisation_id text not null references organisations(id),
  name text not null,
  ...
)

Which means inserting a row into incidents will have Postgres call the generate_ulid() function to provide an ID, ensuring all ID generation happens via this function and on the database server, rather than client.

I tend to prefer generating all IDs serverside, as – even built to the official spec – every ULID generator (say the Go or JavaScript libraries) will behave slightly differently, and it feels nicer that all our IDs are generated from the same function.

For this reason, what follows will focus on the specifics of the pgulid implementation, and how ULIDs behave when used this way in Postgres. YMMV if any of these specifics don’t apply, so be warned!

ULIDs are the primary key

From the SQL example creating the incidents table, you’ll note that our ULIDs are the database primary key, rather than using a ULID alongside an internal bigint/etc key.

Using the same ID internally as you do externally can avoid a layer of complexity, in that you don’t have two possible IDs that might refer to the same row. Requiring developers to think about two IDs jointly and provide some level of translation between external/internal in your queries complicates everything. And if bugs around IDs can be avoided, I’d very much like to!

Sorting is great

There’s also a practical advantage to having tables keyed by ULIDs that relies on ULIDs being lexicographically sortable wrt the time they were generated.

That’s a big word for a simple concept, which is that the first part of a ULID is the timestamp, so as you create successive ULIDs they will sort in ascending order by default.

It means:

expect([
  generate_ulid(), # first
  generate_ulid(), # second
  generate_ulid(), # third
]).to be_sorted

Will pass, because one < two, and two < three.

This is cool because it means:

select id, name from incidents
where organisation_id = '<org>'
order by id;

Will return incidents in the approximate order that they were created, which is the order most users would want to receive results. And provided you create the following index:

create index idx_incidents_organisation_id_id
on incidents using btree (organisation_id, id);

Which we do for most of our tables, then Postgres can execute the query through an efficient index scan. In addition, if you’re cursor paginating such as when powering an API which returns results batches, the query you make to get the next batch:

select id, name from incidents
where organisation_id = '<org>' and id > '<last-seen-id-cursor>'
order by id
limit 25;

Can restart from where you left off, by quickly finding the (organisation_id, id) location in the idx_incidents_organisation_id_id index.

There are caveats to using cursor pagination like this – such as missing rows if they’re in uncommitted transactions when you page past them – but on the whole, default chronological ordering is a useful crutch.

Drawbacks

So with a very basic implementation of ULIDs, everything works and some aspects are pretty nice. But there are drawbacks, even if most are due to our choices rather than ULIDs themselves.

Storage and representation

The geckoboard/pgulid implementation makes a design decision that impacts everywhere ULIDs are used in the database.

Let’s take a look at the generate_ulid() implementation:

CREATE FUNCTION generate_ulid()
RETURNS TEXT -- <<< ?!?
AS $$
  ...
$$
LANGUAGE plpgsql
VOLATILE;

The function returns TEXT, which is the Postgres datatype for variable-length strings, approximately 1 byte per character.

This is a bit problemtic, because ULIDs (as with UUIDs) are 128-bit identifiers, which is 128 / 8 = 16 bytes. But generate_ulid() is producing them in their Crockford base32 string form, which is 26 characters long.

That means we’re inflating our ULIDs by 26 / 16 ≈ 60% when storing them in Postgres, when compared to storing them via the Postgres native UUID datatype.

Consequently:

Each row we store in the database is larger, costing us disk space.
We increase the size of indexes that include ULID IDs, making index scans to read more data from disk.
Joining tables requires building query-local data structures such as hash tables, and larger join IDs require more memory, causing queries to spill onto on-disk temporary buffers.

This is… kinda fine for now, but having ran large Postgres databases before, the difference using smaller join keys can make is significant. We’ll likely pay for this one day, but hopefully far in the future.

Worth saying this is fixable by reimplementing generate_ulid() to return a UUID, but would require additional helpers to format UUIDs in the base32 we’ve been using. So not an easy change now we’re up and running with the text IDs.

Gotchas! around sort order

As I mentioned in “Sorting is great”, there are caveats to ULIDs being sortable. One of them is, awkwardly, that people end up relying on them being sortable.

This cropped up the other day, when an incident.io engineer got confused about why their pagination code was clearly working when hitting their API, but their tests were failing.

The answer for this is in the implementation of generate_ulid():

CREATE FUNCTION generate_ulid()
RETURNS TEXT
AS $$
  ...
  unix_time = (EXTRACT(EPOCH FROM NOW()) * 1000)::BIGINT;
  ...
$$
LANGUAGE plpgsql
VOLATILE;

As we know, ULIDs are prefixed with a time, which is what makes the orderable. But what you might not realise is that the Postgres function now() behaves differently in and out of transactions.

Outside a transaction:

development=# select now(); select pg_sleep(1); select now();
2022-12-30 14:38:52.639301+00
-
2022-12-30 14:38:53.670078+00

In a transaction:

development=# begin; select now(); select pg_sleep(1); select now(); commit;
2022-12-30 14:39:05.211053+00
-
2022-12-30 14:39:05.211053+00

Notice the time is exactly the same when executing within a transaction, because Postgres freezes the value of now() for subsequent calls in the same transaction (clock_timestamp() is an alternative that gives real time), causing everything to look like it happens at the same time.

For the poor engineer writing tests, this meant their IDs sorted randomly every time they ran the tests, and would pass/fail depending on whether the random suffix sorted correctly for their test. Not fun, quite subtle, easily missed.

Performance

Finally, performance of the generate_ulid() function leaves a lot on the table.¹

generate_ulid() is implemented in PL/pgSQL, the Postgres procedural scripting langauge. While quite performant, writing a script in PL/pgSQL won’t compare to a native implementation, and generate_ulid() does several string operations as well as calling pg_crypto helpers to fetch randomness.

Benchmarking this on our production database:

production::DATABASE=> explain analyze select generate_ulid(), * from generate_series(1,100000);
                                                          QUERY PLAN                                                           
-------------------------------------------------------------------------------------------------------------------------------
 Function Scan on generate_series  (cost=0.00..5300.00 rows=100000 width=36) (actual time=7.881..2303.932 rows=100000 loops=1)
 Planning Time: 0.015 ms
 Execution Time: 2312.002 ms
(3 rows)

That’s generating 100k ULIDs, so 43k/s. That doesn’t seem crazy, but let’s compare this to a native Postgres UUID generator:

production::DATABASE=> explain analyze select uuid_generate_v4(), * from generate_series(1,100000);
                                                          QUERY PLAN                                                           
-------------------------------------------------------------------------------------------------------------------------------
 Function Scan on generate_series  (cost=0.00..350.00 rows=100000 width=20) (actual time=23.726..1012.402 rows=100000 loops=1)
 Planning Time: 0.017 ms
 Execution Time: 1019.302 ms
(3 rows)

Allowing us to generate UUID v4s at 98k/s, about 2.3x as fast as we can generate ULIDs. And if you repeat this with bigserial sequence based counters, you’ll find it’s about 10x as fast as the ULID generation.

You won’t notice this in standard OLTP work, but if you’re batch inserting data into Postgres (as we do quite a bit) then it adds up, and applies a speed limit to the number of rows you can create before you even think about the work required to insert it.

Not a big deal, but one to look out for.

The grass is always greener

So a mixed bag, in terms of the practical experience using ULIDs. While it was easy to setup and has support in almost every language/platform, you shouldn’t expect each implementation to be optimal, and you might need to look closely to catch the rough edges.

If you’re picking an ID format now, I’d advise avoiding ULIDs. Given UUID v7 is so widely used and provides many of the same benefits, it feels prudent to use what will become the new standard rather than an offshoot.

Equally, consider whether you really need universally unique IDs for the system you’re building. My favourite ID system was used at GoCardless, where we used a database sequence to provide similar lexicographical properties while adding human readable prefixes for ergonomics, say PM123 for payments and MD123 for mandates. Very few trade-offs there, and lots of advantages.

Whatever you do, good luck. Nothing is perfect, but with YAGNI as your guide, you can’t go too far wrong.

We found a lot of this when digging into Postgres performance issues the other week, and started looking closely at the ULID implementation. At one point we were questioning the use of get_random_bytes which is cryptographically secure and thus dependent on machine entropy, which might run out.

I’m happy to share our production database can generate about 200MB/s of random bytes, so that’s not a problem in practice. ↩

If you liked this post and want to see more, follow me at @lawrjones.