Face recognition data set of trans people still available online years after it was supposedly taken down

A US academic scraped videos off YouTube to train face recognition software on pictures of trans people. New research reveals that his methods were even more carefree than previously known.
Image by EFF Photos on flickr

In 2013, Karl Ricanek, a professor of computer science at the University of North Carolina at Wilmington, announced a new data set of pictures to be used to train face recognition software. The data contained about 10,000 pictures from 38 people, which were extracted from videos he scraped off YouTube. The videos had been uploaded by trans people who documented their hormone replacement therapy (HRT). These transition timeline videos are frequently shared within the trans communities as informational material.

Mr Ricanek saw the material as valuable to increase the precision of face recognition algorithms that, at the time, might not have been able to correctly match the face of an individual before and after HRT. Besides apparently giving little thought to the fact that such a data set could be used to target and harm trans people, he did not consider that the people who created the videos did not consent to having their pictures used in such ways.

In 2017, as word of the data set spread, public criticism mounted. In an investigation by The Verge, Mr Ricanek said that he realized “people could use [the data] for harm, and that was not [his] intent”. He explained that he had tried to obtain consent from the people who posted the videos he scraped but could not reach out to every one of them. He also said that he did not share the actual pictures with anyone, but only the links to the videos, and that he would stop doing that.

Messier than thought

Now, a new investigation by Os Keyes, a doctoral student at the University of Washington, and Jeanie Austin, a library scholar, reveals that the data set’s creation and distribution was much messier than previously thought.

In a peer-reviewed article published in Big Data & Society, they write that the data set, which was supposedly shelved five years ago, was still available online in April 2021, as a Dropbox URL, without password protection. Furthermore, the data set was not a list of YouTube URL, as Mr Ricanek claimed, but contained the videos themselves, including videos that had been since made private or deleted.

The authors asked the University of North Carolina at Wilmington’s institutional review board, an ethics committee which is mandatory at US universities, for information on how the project was approved. They found that Mr Ricanek never sought permission, although it is required for research where “a subject [can] be individually identified by any data”.

Emails exchanges by Mr Ricanek and his team, which the authors obtained through a freedom of information request, reveal that they probably did not seek consent from all the trans people who uploaded the videos. The material which ended up in the data set was likely copyrighted, as none of the videos had been published under a license that allowed reuse.

Finally, the authors looked at the academics who were granted access to the data set. Many passed it on to doctoral students and other researchers without any oversight. Only one out of 16 reusers felt uneasy about the data and did not work with it.

No exception

From the admission of its authors, the investigation was conducted to show how “messy” it is to audit an automated system, and how emotions play a large role in the process. (Both authors, who are trans, explain that they felt suspicion and anger while investigating and that any audit should account for the feelings of the auditors).

Their detailed account of a relatively small data set by the standards of current automated image recognition systems is also exemplary. Other data sets that have been collated by academics have similar shortcomings. ImageNet, which boasts 14 million images and is widely used in research, contains pictures of naked children, drunkenness and violence, and the scholars who put it together did not seek out the consent of the people represented or their legal guardians.

Although Mr Ricanek made clear that he would not proceed as he did if he were given another chance, little has been done in the past ten years to make academics more accountable. Mx Keyes, who is the lead author of the paper, told AlgorithmWatch that many computer scientists still regard any data that has been made public as fair game for any use. “It's intensely disappointing for me to see it still so widespread as an attitude,” they said. While they acknowledge that some within the Machine Learning community might refrain from denouncing these practices, they are relatively upbeat that it can be done from the outside. “It's kind of unlikely that facial recognition researchers potentially bearing grudges is a thing I'll lose sleep over,” they added.

Regulators in Europe are currently debating how to keep automated systems in check. The AI Act, which was proposed by the European Commission but has yet to become law, states in its Article 10-3 that “training, validation and testing data sets shall be relevant, representative, free of errors and complete”. Considering that an average computer vision system relies on millions or billions of pictures, effectively implementing such a law will require a police infrastructure which does not exist yet.

Edited the second paragraph on 20 September to insist on the fact that trans people have been historically discriminated against.

Did you like this story?

Every two weeks, our newsletter Automated Society delves into the unreported ways automated systems affect society and the world around you. Subscribe now to receive the next issue in your inbox!

Nicolas Kayser-Bril

Reporter

Photo: Julia Bornkessel, CC BY 4.0

Nicolas is a data journalist. He pioneered new forms of journalism, regularly speaks at international conferences, teaches journalism at French journalism schools, and gives training sessions in newsrooms. As a self-educated developer, he created interactive, data-driven applications for Le Monde. He built the data journalism team at OWNI and co-founded and managed Journalism++ from 2011 to 2017. Nicolas was one of the main authors of the Datajournalism Handbook.