← Back to blog

What Is a Token in LLMs?

When I first started building with LLM APIs, I kept thinking in words. That mental model caused more production mistakes than I expected.

LLMs do not operate on words the way humans do. They operate on tokens, and once you start designing around tokens instead of words, many confusing behaviors in cost, latency, and quality become predictable (OpenAI Help Center, 2026a).

A Practical Definition

A token is the unit of text an LLM consumes and produces. It is not the same thing as a word and not the same thing as a character. This distinction matters because every hard limit in an LLM workflow, including context windows, billing, and response length, is ultimately token-based.

What Happens When You Send a Prompt

At runtime, your input string is tokenized into pieces, those pieces are mapped to numeric IDs, and the model predicts the next ID repeatedly. The output text you see is a decoded sequence of predicted token IDs. That tokenize-to-IDs-to-next-token loop is the core generation mechanism (OpenAI Help Center, 2026a, OpenAI, 2026a).

This also explains why token counts differ across models. Different tokenizers segment the same sentence differently, so the same prompt can have different budget and cost behavior depending on model choice (OpenAI Cookbook, 2026).

Why "Tokens = Words" Fails in Practice

I still hear rough conversions like "100 tokens is about 100 words." Sometimes that estimate is close enough for casual discussion, but it is unreliable for engineering decisions. OpenAI guidance treats word conversion as an approximation, not an accounting method; common English may average around one token per four characters, but real payloads vary significantly (OpenAI Help Center, 2026a).

In production traffic, structure matters more than prose smoothness. JSON payloads, stack traces, source code, long identifiers, and punctuation-heavy content often consume far more tokens than teams expect. That is why word-count planning tends to underestimate risk in the exact paths that are most expensive.

Why Tokens Matter in Real Systems

The first reason is hard limits. Every request has a maximum context window, and input plus output tokens share that same budget (OpenAI Help Center, 2026b). If the input grows too large, the request fails, context gets trimmed, or output gets clipped.

The second reason is cost. Usage accounting is token-based, including distinctions such as input, output, and cached tokens (OpenAI Help Center, 2026a, OpenAI, 2026b). Teams often focus on choosing a model before enforcing prompt discipline, but in production workloads prompt structure usually drives spend as much as model selection does.

The third reason is latency. Larger token volumes generally increase processing and generation time, so token growth in hot paths often appears as performance regression before teams explicitly connect it to token budgets (OpenAI, 2026c, OpenAI Help Center, 2026c).

The fourth reason is quality under context pressure. More context is not automatically better context. If low-value text dominates the window, critical constraints and instructions become diluted, and response quality becomes less reliable.

How I Budget Tokens in Production

I treat the context window as a planned budget, not a passive limit. In practice, I allocate explicit space for system instructions, task state, retrieval context, tool outputs, and final response. Reserving output space early is especially important, because without it long-form responses are often truncated (OpenAI Help Center, 2026d).

This shows up clearly in long-context assistants. If you append full chat history and raw source documents without compression, you consume most of the window before generation begins. The fix is straightforward: summarize old turns into structured state, retrieve only relevant passages, and keep tool output concise before forwarding it to the model.

A Concrete Scenario

Assume an 8k-token model call where you want a detailed answer. If most of the budget is spent on historical chat and unfiltered retrieval dumps, the model has too little room left to produce an answer with depth. The result is often a clipped or generic response that looks like "model quality" failure but is really budget planning failure.

When the same workflow reserves output tokens, compresses history, and narrows retrieval to high-relevance spans, quality, latency, and cost usually improve together. Tradeoffs remain, but they become explicit and manageable.

Common Build-Time Mistakes

The most common mistake is treating token limits as a late QA check rather than an architecture decision. Closely related mistakes include passing full transcripts where a compact state summary would work, sending duplicate context from multiple tools, and forgetting that tool output consumes the same window as user input.

These are process issues more than model issues, and they are fixable once token budgets are made visible in design and review.

A Repeatable Working Habit

The habit that has helped me most is simple: keep prompts compact and explicit, summarize before appending new context, retrieve less but retrieve better, and measure token counts on critical paths whenever prompt templates change. That discipline makes behavior more predictable across quality, latency, and cost.

Video I Recommend

If you want to understand tokenization from first principles, this remains one of the best practical walkthroughs:

Let's build the GPT Tokenizer (Andrej Karpathy)

Final Take

When someone asks me what a token is, I answer in operational terms: it is the basic unit an LLM reads and writes, and token budgeting is the engineering practice that keeps LLM applications stable.

Once I shifted from word-level thinking to token-level design, most of the confusing behavior in my LLM features became easier to explain, debug, and improve.

References

OpenAI Help Center (2026a) What are tokens and how to count them? Available at: https://help.openai.com/en/articles/4936856-what-are-tokens-and-how-to-count-them (Accessed: 6 February 2026).

OpenAI Help Center (2026b) What is the difference between prompt tokens and completion tokens? Available at: https://help.openai.com/en/articles/7127987-what-is-the-difference-between-prompt-tokens-and-completion-tokens (Accessed: 6 February 2026).

OpenAI Help Center (2026c) Optimizing latency with OpenAI API models. Available at: https://help.openai.com/en/articles/6901266-guidance-on-improving-latencies (Accessed: 6 February 2026).

OpenAI Help Center (2026d) Controlling the length of OpenAI model responses. Available at: https://help.openai.com/en/articles/5072518-controlling-the-length-of-openai-model-responses (Accessed: 6 February 2026).

OpenAI (2026a) Tokenizer tool. Available at: https://platform.openai.com/tokenizer (Accessed: 6 February 2026).

OpenAI (2026b) API Pricing. Available at: https://openai.com/api/pricing/ (Accessed: 6 February 2026).

OpenAI (2026c) Latency optimization guide. Available at: https://platform.openai.com/docs/guides/latency-optimization (Accessed: 6 February 2026).

OpenAI Cookbook (2026) How to count tokens with tiktoken. Available at: https://cookbook.openai.com/examples/how_to_count_tokens_with_tiktoken (Accessed: 6 February 2026).

OpenAI (2026d) tiktoken. Available at: https://github.com/openai/tiktoken (Accessed: 6 February 2026).

Sennrich, R., Haddow, B. and Birch, A. (2015) Neural Machine Translation of Rare Words with Subword Units. Available at: https://arxiv.org/abs/1508.07909 (Accessed: 6 February 2026).