Automated translation is hopelessly sexist, but don’t blame the algorithm or the training data

Automated translation services tend to erase women or reduce them to stereotypes. Simply tweaking the training data or the models is not enough to make translations fair.

Nicolas Kayser-Bril

Ever since Google Translate launched in the late 2000s, users noticed that it got gender wrong. In the early 2010s, some Twitter users expressed outrage that the phrase “men should clean the kitchen” was translated to “Frauen sollten die Küche sauber” in German, which means “women should cleanly (sic) the kitchen”.

Ten years later, automated translation improved dramatically. “Men should clean the kitchen” is now correctly translated to all 107 languages Google Translate offers. But many issues remain.

Google consistently translates the French phrase “une historienne écrit un livre” (a female historian writes a book) to the masculine form in gender-inflected languages. The mistake arises from Google’s reliance on English as a pivot, as AlgorithmWatch showed previously. When translating between gender-inflected languages, Google first translates to English, which has few markers of gender (e.g. “a historian” could be a person of any gender). The English version is then translated to the target language. At this step, Google Translate guesses gender based on the data it was fed during training.

Such errors are not inherent to machine translation. Some services, such as Bing Translator or the European Commission’s eTranslation, accept the existence of female historians.

There is more. In one of eTranslation’s specific domains, “IP case law”, pronouns that are gender-neutral in one language are not assigned a gender in the target language. The phrase “hän hoitaa lapsia” in Finnish is translated to “he/she takes care of children”. Other services assign a gender, usually feminine, to the subject of that sentence.

Just the training data

Markus Foti heads the 20-person-strong team behind eTranslation. When I asked him how they managed to provide more accurate translations than others, at least when it comes to gender, he was quick to point out that they did not, in fact, do much engineering. “The output is a result of what the model learns from the data used to train it,” he told me.

The European Commission built several data sets from the ground up. The use of “he/she” to translate the Finnish “hän” is not a conscious decision by the eTranslation staff. Rather, it comes down to the choices made by the translators who specialize in IP case law and who translated the many rulings that were later incorporated into a training data set.

Mr Foti explained that it would not be practical to force such gender alternatives in all models. Languages that encode gender in more complex ways than English (e.g. in word endings) would be a challenge, not to mention that outputs would be hard to read.

Inside ParaCrawl

For Mr Foti, training data remains the main factor for the sexist outputs of automated translation services. One such data set is ParaCrawl, which is maintained by several European universities and used, among others, by eTranslation.

Anyone can download these training data sets from the website I chose the one with French-English pairs. With over 100 million phrases and 2 billion words, it is the largest on offer. I used grep, a command-line tool, to explore the 26-gigabyte file.

The data set contains a million phrases containing the word “homme” (man) and 900,000 phrases containing “femme” (woman). The difference is just a tenth of a percent of the total number of phrases. But it is not equally distributed.

External content from

We'd like to present you content that is not hosted on our servers.

Once you provide your consent, the content will be loaded from external servers. Please understand that the third party may then process data from your browser. Additionally, information may be stored on your device, such as cookies. For further details on this, please refer directly to the third party.

Female members of Parliament and female ministers are mentioned five times less often than their male peers (“député·e” and “ministre” have other meanings but those are rarely used). This imbalance does not reflect reality. Currently in France, two in five members of Parliament and half of government ministers are women.

Automated translation engines amplify the biases of their training data sets. A 2017 paper looked at a data set where “cooking” was associated 33% more frequently with women. After training, the translation engine produced results where “cooking” was associated 68% more frequently with women.

Pornography and othering

In French slang ‘beurette’ means a young Arab woman. It has a variety of uses, which include emancipatory self-description. It is also widely used in the pornographic industry to provide overwhelmingly white, male viewers with an exotic fantasy. This leads to the othering of Arab women in French society, as Joseph Downing, a fellow at the London School of Economics, wrote in a 2019 book chapter.

ParaCrawl lists 228 phrases containing the word ‘beurette’. Of these, a full 222 are obviously taken from pornographic websites. By selecting such a lexical field for that word, the makers of ParaCrawl perpetuate – perhaps unconsciously – a vision born out of colonialism. (The French government imposed colonial rule over present-day Morocco, Tunisia, Lebanon and Syria for decades and in Algeria for over a century). The real-world impact of such imbalances is hard to assess as racial bias in automated translation is barely studied.

The University of Edinburgh sent me a statement by people involved in ParaCrawl. They said that independent auditors rated 2,100 sentences from the corpus and did not find any to be offensive. The same audit found that a third of one percent of the training data came from pornographic sources. Phrases are “cleaned” prior to being integrated to ParaCrawl using an open-source software.

The world in a data set

The scientists pointed out that a model trained on ParaCrawl performed better than others in a challenge to accurately translate gender. They also said that modifying the data set alone is not the most effective way to address gender imbalances in machine translation. Pointing to recent paper by a team from the university of Cambridge, they claim that “de-biasing data fails to improve gender bias and also degrades translation quality”.

A common thread in my exchanges with the researchers and in some of the literature is that translating gender correctly can impact the quality of the translation. As if the two were not the same thing. (A spokesperson for the University of Edinburgh disputes this and said that “translating gender correctly is one element of overall translation quality”).

While the problem is less acute in English (translating “the doctor” to “el doctor” is not an error in itself), because of English’s pivotal role in most systems, this disregard for gender accuracy is directly responsible for translation errors between other language pairs (translating “die Ärztin” to “el doctor” changes the meaning of a sentence).

The authors of the Cambridge paper also consider their training data, which is almost always scraped off the internet, to represent the world as it is. They spend no time considering that some segments of the population are over-represented in their corpus, and that the choice of websites they scrape reflects their own views much more than the actual world.

Revealingly, the Cambridge team sourced “a gender-balanced set of sentences” from Wikipedia. But seem to disregard the fact that Wikipedia is written almost exclusively by men who have a long history of harassing women.

Beyond data and models

Eva Vanmassenhove, an assistant professor in Artificial Intelligence at Tilburg University, has been studying gender bias in machine translation since 2018. She told me that when she started, there was almost no research on the topic. Still, today, she often hears people questioning the relevance of her research. “Some people feel attacked” when she presents her work, she said.

Ms Vanmassenhove’s research to make machine translation more accurate and more fair involves incorporating additional context to the training data, for instance by adding gender tags. By making gender explicit in the training data, she showed that translations could be significantly improved. (Google Translate’s 2018 update, which offers masculine and feminine forms for some translations, uses a similar approach.)

Producing better machine translation probably requires input from more than just AI researchers. Experts from computer science, linguistics and gender studies need to work together, Ms Vanmassenhove said.

While some collaboration is already taking place, there is clearly room for more.

Did you like this story?

Every two weeks, our newsletter Automated Society delves into the unreported ways automated systems affect society and the world around you. Subscribe now to receive the next issue in your inbox!

Get the briefing on how automated systems impact real people, in Europe and beyond, every two weeks, for free.

For more detailed information, please refer to our privacy policy.