Tokid and the Encodings We Forgot to Measure

Token-native identifiers, and the encodings worth measuring next

May 08, 2026

UUIDs are 32 random hex characters. OpenAI’s tokenizers — along with Claude’s and Gemini’s — use Byte-Pair Encoding: their vocabularies are built by merging the most frequent character pairs into single tokens. Random hex has no frequent pairs to merge, so a UUID gets sliced into roughly 18 to 22 tokens of essentially uncompressible noise. (The two tokenizers I will cite below are cl100k_base, used by GPT-3.5 and GPT-4, with about 100,000 entries; and o200k_base, used by GPT-4o, with about 200,000.)

Many of those entries are full English words. Eight words, each chosen to be a single token in both tokenizers, give you ~84 bits of address space at 8 tokens of cost. That is enough for nearly anything that is not a cryptographic key.

I built one library that does this for identifiers. It is called tokid. It exists because Garrett asked the question that bothered me for a week: how much of our context window are we burning on identifiers nobody reads?

I am confident there are a dozen more encodings worth this treatment. Below are how tokid was built, what is in it, and where the rest of the family goes.

One ID, three jobs

Tokid optimizes one logical ID rendered three ways. Each rendering is the right answer for a different job — model context, structured transport, durable storage.

prompt:    straight course shirt height alter outer rapid verse
transport: straightcourseshirtheightalterouterrapidverse
envelope:  tk1_oa1_straightcourseshirtheightalterouterrapidverse_1oze8

Measured locally for that example:

form chars cl100k_base o200k_base prompt 52 8 8 transport 45 12 10 envelope 59 22 21 uuid_v4 36 18 18

Each word in the prompt form is one atom. The three forms are renderings of the same eight atoms.

The prompt form is what an LLM reads inside natural text — eight tokens, one per word. The transport form is what URLs and JSON values carry, where space-separated text tokenizes less cheaply (the benchmarks are below). The envelope is what you store. It carries a profile tag and a checksum so the payload stays self-describing across releases and processes.

How it was built

Vocabulary selection

Start from the tokenizer vocabulary. Keep only words that tokenize as a single token in both cl100k_base and o200k_base. One of the two shipped profiles uses no separator at all between atoms — for that one, filter further to keep the wordlist prefix-free, so a trie can decode a concatenated payload uniquely. The size of the surviving wordlist sets bits-per-atom.

A prefix-free trie for transport decoding

Once words cannot share prefixes, a greedy trie walk over the raw concatenated payload recovers the atoms uniquely. This is what makes the transport form survive without delimiters inside JSON values and URL paths.

Transport-context benchmarks

Same payload, every reasonable delimiter, across the transport contexts where IDs actually land. At 8 atoms embedded in a url_path, spaces averaged 30.86 tokens, raw concatenation averaged 18.35, underscores averaged 20.20. Raw concat won under most contexts that escape spaces. Hyphens and dots lost worse than underscores in the same study. The two shipped profiles are space-prompt with raw-concat transport (openai-cross-v1) and space-prompt with underscore transport (openai-cross-underscore-v1).

Portable profile manifests and cross-language conformance

The runtime contract is a JSON manifest plus a shared fixture suite. SDKs pass the suite to claim parity. Four live SDKs ship today (JS, Python, Go, Rust); two more are pending registry signoff (Java, C#). They share a manifest and a test fixture, not code.

The 8-atom prefix-free profile carries ~84 bits of entropy. UUID v4 carries 122 random bits. For non-security identifiers — trace IDs, run IDs, document IDs — the gap costs nothing and saves ~10 tokens per occurrence inside model context.

What it is not

Tokid is not a UUID replacement. It does not sort by time. It is not an auth token, a bearer secret, or a tamper-proof capability URL. It is longer in characters than nanoid. It is the wrong tool for backend primary keys the model never reads.

It is the right tool for the case where identifiers regularly cross a prompt, a tool call, or an LLM-readable transport and the token cost on that surface matters.

The encodings we forgot to measure

IDs are one case. The same study — measure the encoding against the tokenizer, pick the option that costs the fewest tokens — works for anything else opaque that the model has to read.

Three obvious neighbors:

URLs. A URL has a path, a query string, an embedded ID, and ceremony — https://, trailing slashes, redundant parameters. A token-native URL library — tokurl is the obvious name — would do for routes what tokid does for identifiers: vocab-token slugs inside the path, plus stripping the parts the model does not need to round-trip.
Hex, base64, MAC addresses, hashes, color codes. These share UUIDs’ problem — dense uniform-distribution bytes that BPE has no useful merges for. A single library could encode any of them into vocab tokens. Same trie. Same profile manifest shape.
Structured data. TOON already lives in this lane, claiming 30-60% reduction on JSON’s structural overhead. It is the existence proof that the territory is real.

Other candidates are sketchier and need empirical work before they are libraries. Timestamps may already tokenize cheaply as epoch seconds; ISO 8601 almost certainly does not. Pagination cursors, error response bodies, and CSS-in-JS class names each share the same character — a string the model has to read, an encoding chosen for a different consumer, a measurable token tax.

Most encodings we use were not designed for the consumer they have now. The gap between what the wire wants and what the tokenizer rewards is large enough to be worth measuring. Tokid is the version of that argument I had time to ship.

Repo

tokid. ISC, alpha, four live SDK channels. Two profiles, both OpenAI-tokenizer-derived. The honest gap is per-tokenizer profiles for Anthropic and Gemini, and that is the next measurement pass.

Garrett gets full credit for pondering about this in a jam session.

Infinite Playground

Discussion about this post

Ready for more?