How researchers are upping their game to audit recommender systems

Recommender systems, such as a social network’s news feed or a streaming service’s recommendations, are notoriously difficult to audit. Despite high hurdles, practitioners from journalism and academia are pushing forward.

Nicolas Kayser-Bril

In July 2021, the Wall Street Journal published an investigation promising to reveal “how TikTok's algorithm figures out your deepest desires”. The journalists built hundreds of robots that “watched” videos on TikTok. The bots were programmed to spend more time on certain videos (about thinness, for instance). Very quickly, the Journal found, TikTok gave the bots much more of the content it thought they were interested in. Users were sent deep into “rabbit holes” of similar content where they could remain stuck. In one case, a robot programmed to focus on sadness and depression was served “a deluge of depressing content” after less than 30 minutes on the app.

The investigation was – rightly – much lauded. But how much does it reveal about TikTok and its impact on society? In December 2021, Finnish researchers replicated the Journal’s experiment but did not find a “rabbit hole” effect. At the same time, Der Spiegel revealed internal TikTok documents showing that the “For You” feed (the main newsfeed on TikTok) actually withheld some content considered interesting for users. By frustrating users with uninteresting content, the internal documents said, the app became even more addictive, adding a new twist to the Journal’s findings.

Notoriously hard

Auditing recommender systems such as TikTok’s “For You” feed, Spotify’s “Your Daily Mix”, or YouTube’s recommendations is notoriously hard. Unlike simpler algorithms, which take a given input to compute an output, recommenders systems typically rely on past user activity, as well as reactions from other users. In other words, because past actions are taken into account, it is impossible to run the exact same experiment twice (for instance by running a control test and another test where a variable is changed).

Despite this hurdle, academics, journalists, activists, and users do investigate. Some, like the Journal’s experiment described above, build a series of bots. The amount of data collected in such cases can be large enough to isolate a specific aspect of the recommender system. The problem, however, is that bots are no humans. Their experience of a recommender system might be so different from ours as to be meaningless.

Bots face a very technical problem, too. It is order of magnitude easier to build bots running in a computer’s browser than to have them use a mobile app. Recommender systems might work differently on a laptop and on a phone, and most users by far use mobile apps. It is not an insurmountable problem, Martin Degeling told me. Degeling is investigating TikTok at Stiftung Neue Verantwortung and, while he currently runs bots in-browser, he said that tools such as Frida could be used to bypass some of the obstacles that prevent automating activity on mobile phones.

Another avenue of work is to use data from actual users. Since the General Data Protection Regulation (GDPR) entered in force in 2018, most online services let users download their personal data. Clément Henin, who recently obtained a PhD on the explainability of algorithms, told AlgorithmWatch that researchers should use anonymized, user-provided data in their auditing efforts, if they manage to assemble a representative sample of users. Such approach is much more ethical than to build browser plug-ins which, Henin added, heighten the selection bias. Only users who are keen enthusiasts of auditing will accept to install such plugins, he said.


Given these obstacles, auditors must remain modest about their findings. Carolina Are, an Innovation Fellow at Northumbria University's Centre for Digital Citizens, told AlgorithmWatch that “a lot of auditing is conjecture”. “No third-party audit can be successful without transparency and communications from platforms,” she said.

While online services might not know how their recommender systems work (a Facebook employee once said the company did not know how a tweak to its newsfeed would pan out), only they have information about who users are – a prerequisite for building a representative sample – and only they have unlimited access to their software.

As a result, these companies often carry out the best audits. In 2021, for instance, Twitter published the result of a randomized experiment on several million users, which showed that the timeline algorithm did favor content from right-wing parties in the United States, Canada, Japan, France, Spain and the United Kingdom (only Germany bucked the trend). But even if other platforms were to carry out and publish such internal research, few have the credibility required for the results to be taken seriously.

Jonathan Stray, a Senior Scientist at the Berkeley Center for Human-Compatible AI working on recommender systems, told AlgorithmWatch that “the most reliable method would be [third-party] audits in collaboration with the platform operator. Several laws require this, either passed laws (DSA article 37) or proposed (PATA [in the United States]).” But “even with full access, it's not yet clear exactly how an audit would work or what it would report,” he added.

Involving users

Besides methodological issues, experts might not audit for the right things. Regular users can also run audits, often by posting a piece of content several times under different conditions and observing the platform’s reaction. Such user-led audits, while less technical, can be powerful, too. In September 2020, a Twitter user published a picture of Barack Obama and Mitch McConnell, showing that the service’s image cropping algorithm cropped out the man with the darker skin tone of the two. After it was widely shared, this mini-audit led Twitter to deactivate the algorithm in question, which was later found to, indeed, favor specific skin tones (as well as being ableist and ageist).

“User-based audits can lead to new insights on algorithmic harms that potentially even well-intentioned first-party audits would not have thought to test for,” Agathe Balayn, a PhD candidate in computer science at Delft University of Technology who works on understanding the harms of machine learning systems, told AlgorithmWatch.

Knowing what to audit for can be as thorny an issue as the technical hurdles exposed above. Recommender systems, taking into account the larger environment they operate in, are so complex that “how they work” is, in itself, an unanswerable question. “The types of explanations about the system should be different for different stakeholders having different objectives (Machine Learning practitioners might want explanations to debug their system, while data subjects might want to obtain explanations to ask for recourse, for instance), but it still is an unresolved topic to really characterize the explanations needed in each case,” Balayn said.

Almost all the experts AlgorithmWatch talked to agreed that auditing recommender systems required collaboration across disciplines. Asking users directly can yield insights as to what might be wrong about a particular service. This ethnographic work is often necessary to be able to design more technical experiments – but not always, as some harms (such as discrimination) might not be visible to users.

“A field such as ‘algorithmic studies’ is slowly emerging,” Balayn said. “I can see both technical, very algorithmic, works (developing methods for third-party audits), and more socio-technical investigations (studying how current audits are performed by practitioners or users and what are the challenges) being published in the last few years”.

Auditing for change

Despite hard challenges, audits can have profound effects. Carolina Are’s autoethnography of Instagram’s shadowbanning policy (restricting the visibility of a user’s post without informing them), for instance, proved useful in understanding how this recommender system worked. The results, in turn, informed her activism on online censorship.

In another case of a user-led audit leading to concrete change, British model Nyome Nicholas-Williams, who is Black and plus-size, revealed that Instagram censored pictures of her. After intense campaigning, the social network changed some of their policies.

But the power of audits should not be overstated. “Until the system and structure of platform governance is in place, audits help investigate and raise awareness about issues, but platforms only make cosmetic changes as a result”, Are said.

Edited on 3 November: In the latest version of the DSA, audits are at Art. 37, not 28.

Did you like this story?

Every two weeks, our newsletter Automated Society delves into the unreported ways automated systems affect society and the world around you. Subscribe now to receive the next issue in your inbox!

Get the briefing on how automated systems impact real people, in Europe and beyond, every two weeks, for free.

For more detailed information, please refer to our privacy policy.