Output Token Order is a Hidden Variable in LLM Response Quality and Accuracy

Two lessons from production

Jan 04, 2025

Every output token an LLM generates depends on all preceding tokens. This includes previously generated output tokens.

Below, x is a token the LLM is processing. It can be an input token provided by the user, or an output token it's generated in this run.

\(p\!\bigl(x_t^{(\mathrm{out})} \,\big|\, x_1^{(\mathrm{in})}, x_2^{(\mathrm{in})}, \dots, x_N^{(\mathrm{in})}, x_1^{(\mathrm{out})}, x_2^{(\mathrm{out})}, \dots, x_{t-1}^{(\mathrm{out})}\bigr).\)

As soon as an output token is generated, it becomes part of the tokens affecting the probability of the next token.

This implies that you can steer the LLM to better outputs simply by front-loading output tokens that will have an impact on later tokens. These front-loaded outputs anchor the LLM and help shrink the universe of possible subsequent output tokens.

I've found two ways of leveraging this, I’m confident there are dozens more.

Below are short explanations of both, and my anecdotal evidence.

JSON Field Ordering

Let's say we're building a voice agent that speaks with users over the phone.

We have a prompt that evaluates the transcript and decides when we should end the call. If we are going to end the call, is_relevant is true and then we'll reply with the response generated.

{
	is_relevant: boolean,
	response: string
}

Because I have the LLM generate is_relevant field before I generate response field, the relevancy is baked into the response.

That's not true if I reordered it.

{
	response: string,
	is_relevant: boolean
}

Here, the response would be generated first and relevancy would be influenced.

JSON Field Names

In the same response format for our end call prompt, we have an opportunity to eek out slight more efficiency.

Instead of having generic field names, replace them with more targeted names.

{
	end_call: boolean,
	contextual_response: string
}

These more targeted field names are generated before their values and will sway the generation of their values to be more targeted as well.

Anecdotal Outcomes

Simple JSON Payloads

The example prompt for ending a call is a simplified example of a prompt we use in production. We have similar prompts for forwarding calls, supervising behavior and deciding if topics stray too far from allowed topics, etc.

When we first started writing these prompts, all of them ran on is_relevant as a standard. We'd give 10-15 demonstrations of relevancy, explain that is_relevant means "choosing to end the call is relevant", with other instructions to make that clear.

Nothing we tried in our system prompt gave us as much of a gain in accuracy as renaming is_relevant to end_call, forward_call and topic_is_out_of_bounds. That clarity, added right before the boolean was decided, easily added 5-10% accuracy on our prompts. The compounded effects of each individual prompt gaining 5-10% accuracy made the system as a whole more predictable and reliable.

There are several edge cases we discovered where both the end call and forward call prompts triggered at the same time, or where call forwarding would trigger when you'd expect end call to trigger instead. These edge cases fell off significantly with the more specific naming convention. Having a more targeted field name meaningfully shrunk the universe of what the LLM considered relevant.

More nuanced JSON Payloads

Imagine a prompt that looks at a ticket coming in from Zendesk and evaluates urgency of the ticket, what team it should be routed to, and a recommended response to the customer.

enum Urgency {
	CRITICAL: "critical",
	HIGH: "high",
	MEDIUM: "medium",
	LOW: "low",
	NOT_URGENT: "none"
}

enum Team {
	FRONTEND: "frontend",
	DEVELOPER_EXPERIENCE: "developer_experience",
	INFRASTRUCTURE: "infrastructure"
}

// JSON Payload
{
	team: Team,
	urgency: Urgency,
	response: string
}

Assume that the system prompt contains instructions and examples for how to choose urgency and the right team.

Here, if you label something as Urgency.CRITICAL first, you are more like to a get an urgent-sounding response. If the response is generated before urgency, it's tone might create a more moderate urgency for the same conversation.

// Urgency first.
{ 
	team: Team.FRONTEND, 
	urgency: Urgency.HIGH, 
	response: "Thank you for bringing this to our attention. Our frontend team is looking into the issue and will keep you updated." 

}

// Response first
{ 
	team: Team.FRONTEND,
	response: "We've noted your concerns and will investigate. Thank you for your patience.", 
	urgency: Urgency.MEDIUM, 

}

I've seen this play out personally on the margins. Harder to estimate the impact as a percent of accuracy. I need to do a deeper dive here.

My gut instinct tells me that the larger your JSON payload is, the higher your likelihood is of having some degree of nuance-dependency between fields. The higher that nuance-dependency, the higher the accuracy gains you'll get from a reordering.

Summary

If you have a good method for evals, it's worth testing these small micro optimizations. I've had good luck with o1 and Claude finding more specific naming conventions and better structures to experiment with.

My team at work has begun defaulting to this more specific method for naming fields and ordering them, and has seen real results. If you've seen something similar in your work, please let me know. I'd love to hear more about the use case. DM me on Twitter, or email me.