How not to: We failed at analyzing public discourse on AI with ChatGPT

We wanted to collect data on how newspapers wrote about AI. Swayed by the sirens of OpenAI, we thought that using a Large Language Model was the best way to do such a massive text analysis. Alas, we failed. But LLMs are probably not to blame.


6 February 2024


Foto von Flipboard auf Unsplash

In recent years, many European media outlets peddled the tale of Big Tech creating an apocalyptic Artificial Intelligence (AI) that would overtake and even end humanity for good. As there isn’t any tangible evidence to back up their claims, they repeat the discourse of the Silicon Valley tech companies' CEOs.

The main plot in this story is the advent of Artificial General Intelligence (AGI), when computerized systems become sentient and outperform humans at all things humans do. Technology firms are not remotely close to creating such a model, and the very idea that machines and animals are comparable is disputed. But that does not stop every journalist from writing about AGI.

Examples of this tendency abound. The ChatGPT launch in 2022 was probably a turning point in the way AI was depicted to the public. This was also the year when a Google engineer claimed to have interacted with a Large Language Model (LLM) that had allegedly gained consciousness (it hadn’t). Then the CEO of a car maker (and frequent liar) said that his cars had a mind, and many news outlets obligingly made it headline news. More recently, OpenAI, the company behind ChatGPT, became mired in boardroom intrigues. Even regional news outlets ran stories on the affair, in the style of a soap opera, adding subplots of worrisome new AGI-capable models.

Even though not all news outlets partake in this AI hype, there is a problem with the coverage of automated systems in European media. How big is it? Are critical voices given any room? Do newspapers cover cases of automated systems gone wrong? Are some newspapers better than others? Has coverage changed a lot since ChatGPT entered the scene? These are the questions we wanted to find answers to.

This research was complementing the work done by our Fellows in algorithmic accountability reporting, who researched cases of individuals fighting back against automated systems. We wanted to know why their fights failed to reach prime-time news coverage.

Messy data

Several scholars already researched the issue. Maxime Crépel and Dominique Cardon analyzed the coverage of AI in English-language media from 2015 to 2019. Anne Bellon and Julia Velkovska did it in France over the period 2000-2019.

We wanted to go further on two counts. First, by providing a European perspective, because an algorithm-related scandal in one EU country rarely makes headlines in neighboring countries (whereas stories from Britain or the United States are well-covered on the continent). Second, we wanted to include coverage until late 2023, in order to assess the impact of ChatGPT on the public discourse.

We built a dataset of articles containing the words “artificial intelligence” or “algorithm” over the period 2013-2023. To do so, we paid (dearly) for access to LexisNexis, a database of newspaper articles. And therein lies our biggest mistake. The service was very slow to use, eating hours of work to just download the corpus. The data we obtained was of very poor quality. The files were in docx format, with metadata mixed with the actual article body. In some cases, LexisNexis' scrapers were faulty and the articles we obtained were unrelated to our search. There were many duplicates (the same article appearing more than once) that we needed to sort out.

Training GPT to our will

Undeterred, we struggled on. We chose nine European media outlets that published in English, Spanish, and French and downloaded all articles from 2013 to 2023 – which accounts for over 30,000 stories.

With the help of Boris Reuderink, an expert in statistics, we wanted the Large Language Model ChatGPT (in its 3.5 version) to classify our data. This didn’t seem too much to ask. We aimed to automatically sort articles in groups: those that mentioned the proven capacities of AI and the associated risks on one hand, and those that gloated over the futuristic risks of AGI on the other.

To help train the model, we randomly selected 150 articles and manually labeled them.

Once we started, we figured that the two categories did not fully cover the number of angles that the media used to discuss AI. We added new categories that seemed relevant to answer our research questions: articles about the AI industry, those about the legislative framework around AI, those about the impact of AI on work, and those that anthropomorphized automated systems. With these labels, we hoped to find out, for instance, whether endowing AI systems with human-like quality was correlated with a more uncritical take on the industry.

It was fun while it lasted

Although we took care to define our categories as unambiguously as possible, manual labeling was difficult. Some articles combined contradictory approaches to AI threats. For instance, they mentioned the pessimistic outcome for humankind due to the evolution of AI before explaining that this was unlikely to happen. Still, we refined our definitions and ironed out the discrepancies in our labeling.

But even as we went ahead, ChatGPT kept labeling inconclusively. From a statistical perspective, considering the training data at hand, OpenAI’s LLM was performing correctly. But from our – journalists – perspective, the accuracy was still poor. More importantly, we could not identify why some labels were applied (we programmed ChatGPT to highlight the most salient sentences in an article, but that did not help much).

Adding more categories meant that the LLM was making more mistakes on the original two. Leaving the model to classify the articles without supervision was not an option, and we were still not sure whether our label definitions were clear enough.

On top of these methodological issues, the ChatGPT API was very slow, and stalled frequently, making it impossible to test any new approach on a large batch of articles.

In the end, the results were hardly usable. The automatically-generated labels were fairly accurate but nowhere near the accuracy we needed to draw conclusions.

Lessons learnt

The problem was not LLMs as such. They classified the articles we provided based on the training data we labeled by hand. However, as any Machine Learning system, this approach requires manually-labeled data. Investing the time required for this is very unpractical when starting an investigation, for journalists hope to get insights from the data rapidly.

Even if ChatGPT did not fail catastrophically, as it can do, it is unclear whether journalists can use it to do data analysis on large corpora. Perhaps sticking to word counting and co-occurrence matrices would have been enough. Or perhaps we should have split the articles into shorter chunks of a few sentences that would have been easier to interpret for both humans and machines.

In any case, we hope that our experience will help others refine their methodologies and better select their tools.

Naiara Bellio (she/her)


Naiara Bellio covers the topics privacy, automated decision-making systems, and digital rights. Before she joined AlgorithmWatch, she coordinated the technology section of the foundation, addressing disinformation related to people's digital lives and leading international research on surveillance and data protection. She also worked for Agencia EFE in Madrid and Argentina and for She collaborated with organizations such as Fair Trials and AlgoRace in researching the use of algorithmic systems by administrations.