Are all LLMs the same? I set out to measure their similarity.

AI-generated text often has a certain je-ne-sais-quoi of monotony and blandness, whatever model is used. An attempt to put a number on this phenomenon shows once more that measuring meaning is close to impossible.

Story

14 November 2025

#llm

close-up photo of Goosebumps Slappy the Dummy ventriloquist doll
Nicolas Kayser-Bril
Head of Journalism

Similarity testing. I'm certainly not the only one who feels that generative AI, and large language models (LLMs) in particular, always output the same thing. Several users have shared their dismay that LLMs always tell the same predictable jokes, for instance. Just how similar are these models, really? Finding out is much harder than it sounds.

The human feeling of similarity is very far removed from a machine-readable concept of equivalence. The same meaning can be expressed in countless ways. Conversely, identical phrases can take on vastly different meanings depending on the context. Emily Wenger, a professor at Duke University who researches this area, told me that we actually expect models to be similar in many respects, such as proper use of language or adherence to facts. "We want them to be original in ways that humans find helpful," she added.

An experiment. Undeterred, I devised a small experiment. I asked three LLMs to produce a list of 200 dates in world history, making it clear that I wasn't looking for an academic compendium - a task that can be approached in any number of ways. As historian Michel-Rolph Trouillot wrote, "history is both what happened and what is said to have happened." In other words, any list of 200 events is effectively its own version of History.

Working with dates has another advantage. Textual similarity that appears clear to humans can be worlds apart for a computer. For instance, "First successful atomic bomb used in warfare" and "Enola Gay takes off for Hiroshima" are very different texts but refer to the same event. By relying on the dates, I could easily reconcile the output of the LLMs.

Results. I collected approximately 300 batches of dates from OpenAI, xAI and Google Gemini, totaling over 70,000 events. In all but one case, the programs were unable to count to 200, returning anywhere between 49 and 850 events. Some results reflected poorly on the overall capacities of LLMs, such as xAI's outputting "Nottingham forest fire? Wait, Notre-Dame fire" for an event on 15 April 2019 or OpenAI's announcing that "Mars colonization program expands to multiple settlements" on 1 December 2045.

Still, after some data cleaning (read the details on Github), I managed to build comparable data sets. I then compared each list of dates against every other list in the sample. Unsurprisingly, two lists output by the same model were generally more similar than lists generated by different models. However, in some cases, the difference was slight. Two lists produced by OpenAI's ChatGPT had, on average, 18 events in common out of 100. A list by ChatGPT and a list by Gemini shared 14 common events. xAI's Grok proved to be slightly more divergent from the other two.

Shades of gray. This experiment -- which is by no means perfect -- can put a number on LLM similarity, but it falls short of measuring what we, humans, interpret as such. Looking through the lists of events, I was most struck by the omissions. For instance, the fall of Rome in 476 was mentioned hundreds of times, yet the fall of Angkor in 1431 was not mentioned at all (to repeat Trouillot's point, the historical merit of these dates is irrelevant). The destruction of Hiroshima appeared in almost all data sets, but the destruction of Benin City (in 1897) was missing entirely. Even conservative visions of history were lacking. Birth and death of "great men" are few and far between, delivering a blow to the 19th century's "great man theory." Religion is almost absent from the data, as is art history. Abortion access was mentioned twice and contraception three times, out of a total of 70,000 events).

As could be expected, the data reflects the worldview of Californian programmers. Beyond the similarity of the dates themselves, the LLMs' outputs reflect a technical, US-American, white, and masculine vision of history. In these conditions, measuring similarity feels like building a colorimeter calibrated solely on shades of gray.

An alternative approach would be to compare LLMs to humans. Emily Wenger of Duke University did just that in an article published in January. She demonstrated that responses from humans were much more diverse than those from LLMs. Further research from her lab showed that LLM similarity increases as their training data sets overlap. Given the ongoing race to use all available human-made artifacts as training data, we should expect LLMs to feel ever more similar.


This is an excerpt from the Automated Society newsletter, a bi-weekly round up of news in automated decision-making in Europe. Subscribe here.