The ethical abyss of big data research

When learning was still a predominantly human undertaking.

“If a dataset can be downloaded from the web, regardless of whether it originated from a breach or other illegal activity, it is considered to be in the public domain and falls under the IRB exemption for public domain datasets.”

This is only one of many staggering assessments Kalev Leetaru got in response to his requests to data scientists and researchers, universities, research institutions, and research funders, asking them to elaborate on how they assured that their use of big data sets were ethically tenable. Leetaru, Senior Fellow at the George Washington University Center for Cyber & Homeland Security, was confronted with a plethora of answers - almost all explaining why the addressed would be unable to explain their reasoning. All this at a time when a number of high profile studies had to be retracted after publication in reaction to a firestorm of criticism by other researchers (who have a more accurate ethical compass, apparently).

Looking at Leetarus article, we find his findings concerning machine learning particularily illuminating:

Even research which is conducted within the university setting is increasingly pushing up against new ethical frontiers in the creation of machine learning algorithms based on vast pools of human-created training data. For example, several researchers I spoke with mentioned situations where colleagues had taken large datasets licensed to the university for strictly non-commercial use or collected from human subjects for strictly academic research and used them to construct large machine learning computer models. These models were then licensed from the university to the faculty member’s private startup, where they were then used for commercial gain. In at least some cases, protected human subjects data was used to create a computer model for academic research, which was approved by IRB, but that model was then allegedly subsequently licensed by the university for commercial use to the faculty member’s startup. None of the researchers were privy to whether IRB had approved the commercial licensing or if that occurred without IRB knowledge and they argued that the very nature of a machine learning model deidentifies such data to the point that it should no longer be considered human subjects data. Even in cases where existing “public” licensed datasets were used for IRB exempt projects, this creates a highly novel ethical and legal landscape as universities leverage their unique academic status to acquire large datasets under free or vastly reduced licensing schemes and then transform that data into commercial products. At what point is human subjects data sufficiently transformed to the point that it no longer is subject to IRB approval? Should IRBs review all commercial licensing of algorithms, datasets and software from universities for ethical oversight?