Output Token Order is a Hidden Variable in LLM Response Quality and Accuracy
Two lessons from production
Every output token an LLM generates depends on all preceding tokens. This includes previously generated output tokens.
Below, x
is a token the LLM is processing. It can be an input token provided by the user, or an output token it's generated in this run.
As soon as an output token is generated, it becomes part of the tokens affecting the probability of the next token.
This implies that you can steer the LLM to better outputs simply by front-loading output tokens that will have an impact on later tokens. These front-loaded outputs anchor the LLM and help shrink the universe of possible subsequent output tokens.
I've found two ways of leveraging this, I’m confident there are dozens more.
Below are short explanations of both, and my anecdotal evidence.
JSON Field Ordering
Let's say we're building a voice agent that speaks with users over the phone.
We have a prompt that evaluates the transcript and decides when we should end the call. If we are going to end the call, is_relevant
is true
and then we'll reply with the response
generated.
{
is_relevant: boolean,
response: string
}
Because I have the LLM generate is_relevant
field before I generate response
field, the relevancy is baked into the response.
That's not true if I reordered it.
{
response: string,
is_relevant: boolean
}
Here, the response would be generated first and relevancy would be influenced.
JSON Field Names
In the same response format for our end call prompt, we have an opportunity to eek out slight more efficiency.
Instead of having generic field names, replace them with more targeted names.
{
end_call: boolean,
contextual_response: string
}
These more targeted field names are generated before their values and will sway the generation of their values to be more targeted as well.
Anecdotal Outcomes
Simple JSON Payloads
The example prompt for ending a call is a simplified example of a prompt we use in production. We have similar prompts for forwarding calls, supervising behavior and deciding if topics stray too far from allowed topics, etc.
When we first started writing these prompts, all of them ran on is_relevant
as a standard. We'd give 10-15 demonstrations of relevancy, explain that is_relevant
means "choosing to end the call is relevant", with other instructions to make that clear.
Nothing we tried in our system prompt gave us as much of a gain in accuracy as renaming is_relevant
to end_call
, forward_call
and topic_is_out_of_bounds
. That clarity, added right before the boolean was decided, easily added 5-10% accuracy on our prompts. The compounded effects of each individual prompt gaining 5-10% accuracy made the system as a whole more predictable and reliable.
There are several edge cases we discovered where both the end call and forward call prompts triggered at the same time, or where call forwarding would trigger when you'd expect end call to trigger instead. These edge cases fell off significantly with the more specific naming convention. Having a more targeted field name meaningfully shrunk the universe of what the LLM considered relevant.
More nuanced JSON Payloads
Imagine a prompt that looks at a ticket coming in from Zendesk and evaluates urgency of the ticket, what team it should be routed to, and a recommended response to the customer.
enum Urgency {
CRITICAL: "critical",
HIGH: "high",
MEDIUM: "medium",
LOW: "low",
NOT_URGENT: "none"
}
enum Team {
FRONTEND: "frontend",
DEVELOPER_EXPERIENCE: "developer_experience",
INFRASTRUCTURE: "infrastructure"
}
// JSON Payload
{
team: Team,
urgency: Urgency,
response: string
}
Assume that the system prompt contains instructions and examples for how to choose urgency and the right team.
Here, if you label something as Urgency.CRITICAL
first, you are more like to a get an urgent-sounding response. If the response is generated before urgency, it's tone might create a more moderate urgency for the same conversation.
// Urgency first.
{
team: Team.FRONTEND,
urgency: Urgency.HIGH,
response: "Thank you for bringing this to our attention. Our frontend team is looking into the issue and will keep you updated."
}
// Response first
{
team: Team.FRONTEND,
response: "We've noted your concerns and will investigate. Thank you for your patience.",
urgency: Urgency.MEDIUM,
}
I've seen this play out personally on the margins. Harder to estimate the impact as a percent of accuracy. I need to do a deeper dive here.
My gut instinct tells me that the larger your JSON payload is, the higher your likelihood is of having some degree of nuance-dependency between fields. The higher that nuance-dependency, the higher the accuracy gains you'll get from a reordering.
Summary
If you have a good method for evals, it's worth testing these small micro optimizations. I've had good luck with o1 and Claude finding more specific naming conventions and better structures to experiment with.
My team at work has begun defaulting to this more specific method for naming fields and ordering them, and has seen real results. If you've seen something similar in your work, please let me know. I'd love to hear more about the use case. DM me on Twitter, or email me.