Large language models as attributes of statehood

As governments champion AI, the development and control of large language models are becoming matters of state. European countries are investing heavily in language resources, with geopolitics casting a large shadow over this endeavor.

Story

23 January 2026

#llm #publicsector

Prunksaal der Österreichischen Nationalbibliothek
Photo by Iveri MODEBADZE on Unsplash
Dr. Nicolas Kayser-Bril
Head of Journalism

Statehood. In the late 18th century, Johann Gottfried von Herder famously wrote that “every nation is one people, having ... its own language.” Today, that view looks outdated and the equation “one language = one country” does not hold true (to be fair, Herder wrote before Belgium came into being). Yet state-building was long a project in language-building, if only because the state apparatus required a coherent means of communicating its decisions to the population it claimed as its own. In many cases, the national language was codified only after the state was created, as speakers of Montenegrin can attest.

What many speakers of French or German take for granted, such as the infrastructure of dictionaries, was built from scratch in much of Central Europe. The Estonian etymological dictionary, for instance, was only completed in 2013; its Slovak counterpart followed in 2016.

LLMs. Such works are known as “language resources” in the jargon of Artificial Intelligence developers and are crucial to training the large language models (LLM) that underpin most AI applications. Other resources include vast corpora of texts, ranging from books to web pages.

The graph below shows that the sheer volume of resources available for English toweringly exceeds that of most other languages. As a result, LLMs tend to perform less well overall for “low-resource languages”.

External content from datawrapper.com

We'd like to present you content that is not hosted on our servers.

Once you provide your consent, the content will be loaded from external servers. Please understand that the third party may then process data from your browser. Additionally, information may be stored on your device, such as cookies. For further details on this, please refer directly to the third party.

Gearing up. Many governments now aspire for their countries to have a high-performance LLM for their national language(s). Last week, the head of Serbia’s e-government services announced a new national LLM as an instrument of “state sovereignty.” Investments in language resources long predate the AI craze, but the scale of ambition has shifted. The Slovak national corpus, for instance, is a long-running project to digitize texts in Slovak, launched in 2002. The government has allocated about €30,000 per year to it ever since. Contemporary efforts in other small EU states are far more lavish. Estonia is investing close to €1m a year in language resources, and Lithuania almost €10m.

However, in proportion of the state budget, the largest investor in language resources is not a small country but a former imperial power. In 2022, the Spanish government allocated €1bn over five years to the “Strategic Plan for the Promotion of Spanish Languages,” following a €90m project launched in 2015. The initiative is as much about geopolitics and business as it is about linguistics. The Spanish government’s build-up of language resources aims explicitly at dominating AI services in Latin America, and its focus on Basque, Galician, Valencian and Catalan may also serve as a way to one-up independence-minded regional governments.

Unwanted attention. Governments usually intervene in linguistics for their own ends, including population control. In the mid-1930s for instance, Moscow imposed the requirement that every language in the Soviet Union be written in Cyrillic. Non-Russian-looking alphabets were seen as seditious. Today, some linguistic minorities sense a similar threat from language technologies. The development of an LLM fluent in Romani (admittedly not funded by a government) raises fears that it could be used for eavesdropping and step up the policing of Roma people.

In some cases, LLMs could even become liabilities for national security. AI-generated disinformation campaigns or the automated analysis of intercepted communications are now commonplace in warfare. Unbeknownst to themselves, Greenland enthusiasts who for years published reams of gibberish in the Greenlandic Wikipedia, using poor translation tools, may have, by sabotaging the few “resources” available in this language, greatly bolstered the island's security.

I thank Ľubor Králik and Alexander Maxwell, as well as my colleagues Eva Lejla Podgoršek and Naiara Bellio, for their help with this article.


This is an excerpt from the Automated Society newsletter, a bi-weekly round up of news in automated decision-making in Europe. Subscribe here.