Perplexity and Burstiness: The Metrics Behind AI Detection

Updated April 2026 · 4 min read

Every AI detector reduces text to a handful of numerical metrics. Two of them do most of the work: perplexity and burstiness. If you understand what these measure and why AI text scores the way it does on each, you understand 80% of how detection works — and you know what to target if you're trying to make text sound more human.

Perplexity: how predictable is this text?

Perplexity is a concept from natural language processing. It measures how surprised a language model would be by a given text. Low perplexity means the next word was easy to predict; high perplexity means the next word was unexpected.

Human writing has higher perplexity than you'd expect. We use unusual words, unexpected phrasings, ideas that don't naturally follow the previous sentence. AI text is generated by maximizing predictability — each word is chosen because the model thinks it's the most likely next word given the context. The result: AI text has systematically lower perplexity than human writing about the same topic.

The catch is that some human writing also has low perplexity. Formal academic prose, legal writing, technical documentation — genres where the vocabulary is constrained and the phrasing conventionalized — can score as low as AI on this single metric. That's why no detector uses perplexity alone.

Burstiness: how much does it vary?

Burstiness measures variation across the text. Specifically, it looks at sentence length: how much the number of words per sentence fluctuates from sentence to sentence across the document.

Human writing is bursty. One sentence is four words, the next is thirty-five, the third is sixteen. This variation reflects how humans actually think — we compress some ideas into fragments and expand others across multiple clauses. AI writing is metronomic. Sentence after sentence lands in the same narrow range — typically 18 to 25 words. The variation is small and the rhythm is flat.

Burstiness is harder to fake than perplexity. You can swap in unusual words to raise perplexity without changing the structure of your sentences. You can't raise burstiness without actually restructuring — cutting long sentences, merging short ones, adding fragments.

Why AI scores low on both

Both metrics are low in AI text for the same underlying reason: language models are trained to generate the most likely continuation of a prompt. Each word is the most likely next word; each sentence is the most likely next sentence structure. The model has no internal rhythm, no point it wants to make, no specific memory to reach for. It produces text that sits in the statistical middle of its training distribution.

Humans don't do this. We write to say something specific, which takes us to unusual words and asymmetric structures. The mathematics of language generation produces smooth, uniform text; the mathematics of human intention produces bumpy, varied text. Detectors measure the bumpiness.

How detectors combine the metrics

No serious detector uses just one metric. Tools like GPTZero, ZeroGPT, and Sapling combine perplexity, burstiness, vocabulary diversity, and several smaller signals into a single score. Some weight perplexity more heavily; others weight burstiness. This is why different tools disagree on borderline texts — they're summing the same ingredients in different proportions.

When two detectors disagree, the disagreement usually comes down to this weighting. If your text has low perplexity but high burstiness (structured but varied), some detectors will flag it and some won't. That ambiguity is the useful signal: the text is genuinely borderline, not definitively one thing or the other.

What this means for writing

If you want your writing to read as human — to a human or to a detector — target these metrics directly. Vary your sentence lengths deliberately. Use specific and unusual words rather than the "safe" options. Cut formal connectors. Add specific details that couldn't come from a generic language model.

These are all things good writers do anyway. Detectors are essentially measuring how well-varied and specific your writing is. The overlap between "writing that reads as human" and "writing that's engaging" is nearly total. Improving your detector score and improving your writing are mostly the same project.

Why RealText exposes the metrics directly

RealText shows you perplexity, burstiness, and TTR as separate numbers instead of hiding them behind a single score. This lets you see which metric is pulling your score down and target it specifically. A text with low burstiness and normal perplexity needs rhythm work; a text with low perplexity and normal burstiness needs vocabulary work. Without that breakdown, every edit is a guess.

See the metrics behind your score.

Try RealText Free →