Tools to predict risk in conflict situations are the order of the day. They have existed and been in use for decades, largely predating the hype around Artificial Intelligence. Many of them do not use Machine Learning. Instead, they are based on a psychological questionnaire designed and evaluated by people, each with their own biases and prejudices.
Gender-based violence is one area where much emphasis is placed on prediction tools, although not all systems are known widely. While the research community is familiar with Spain’s VioGén algorithm and the Ontario Domestic Assault Risk Assessment (ODARA) questionnaire in Canada, an algorithm from the Basque Country, an autonomous region in the north of Spain, has slipped below the radar.
The tool is called EPV, which stands for “Escala de predicción del riesgo de violencia grave contra la pareja” in Spanish – or Intimate Partner Femicide and Severe Violence Assessment. It was created in 2005 and has been in use since January 2007, so in fact it existed prior to VioGén. A “revised” version created in 2010, called EPV-R, is currently used by the Ertzaintza, the police force operating in the Basque Country.
Viogén and EPV-R are not the same tool. They use different sets of questions and different statistical analyses, and Basque police officers must make specific judgement calls when applying the questionnaire. In particular, they decide whether a person is very or not very jealous or whether they come from a country with "a different culture to the western one".
An algorithm with judicial clout
The Basque Country’s equivalent of VioGén is a computer platform, which the police use to sort cases related to gender and domestic violence, called EBA (Emakumeen eta Etxekoen Babesa in Euskera, the language of the Basque Country – meaning Protection of Women and Households). Through this program, the police assign risk probabilities and police protection to the victims.
This process runs parallel to the judicial investigation. This means that the Ertzaintza has its own protocol to assign protection measures for the victims, even if the case has not yet been decided by a judge, and it’s handled by the police depending on the risk assessment assigned to each person. The assessment can change if the police manage to gather more information from the victim, the offender or the environment, but EPV-R will always be involved in this also.
This is true to the point that the results of the questionnaire are included in all cases as evidence in the report received by the judges, something that does not happen in other similar systems. Judges then face the dichotomy between their own opinion about the level of risk and that which is assigned by the police – and the algorithm.
Nationality matters (but some matter more than others)
EPV-R is a questionnaire made up of 20 questions which provide a score up to a maximum of 48 points, Oskar Fernandez, chief inspector in the Ertzaintza and EBA platform developer, explains to AlgorithmWatch. Some of the questions are worth only one point with others, with a larger predictive value, worth up to three points. Officers must ask a minimum of 12 questions, of which six must be those of larger predictive value. A score of more than 24 points means the risk is severe, more than 18 is high, more than is 10 medium and if the score falls beneath 9, it is supposedly low.
Fernandez explains that the questionnaire is not necessarily asked in numerical order to victims. The first question on the list is only worth one point, but it makes reference to whether the victims themselves or the offenders are “foreigners” — except the word used doesn’t refer to all foreigners: “This question refers to people who have a culture other than the Western one,” he clarifies, and gives the example of how a person from France or the French Basque Country would not count as a foreigner.
“It would only apply in all the cases where there exists a different cultural understanding than the European when it comes to a couple,” Fernandez states. It appears that neither a specific methodology nor list of countries referred to is followed as the questionnaire proceeds, and the choice for officers is subjective.
Data collected by the National Statistics Institute regarding sexual offences shows that, in general, one-in-five of these crimes are perpetrated by non-nationals. Nevertheless, nationality keeps coming up as a necessary and imperative factor in risk assessment algorithms, as seen with other examples in Spain like RisCanvi, a computer tool used in Catalonia to predict whether a convict will reoffend after their prison sentence. Nationality is also one of the points of analysis used by the evaluators using this algorithm.
Most of the questions with a “larger predictive value” refer to violence: whether there is a history with the offender being violent to someone other than their partner, whether death threats are made, weapons involved or whether the offender is intensely jealous.
Independent researchers that have looked into the operation of the algorithm consider some of these parameters to be very complicated to evaluate, for instance the ones referring to the offender being jealous. “From a more philosophical perspective, the problem lies in quantifying measures that are too subjective”, says Ana Valdivia, researcher at the Oxford Internet Institute and member of AlgoRace. To that, Fernandez says that sometimes jealousy is so “obvious” and “apparent” that “it is not necessary to be a psychologist” to detect them. Thus, the grading between 0 and 3 points depends on the interpretation of each agent.
Fernández also explained that in addition to the joint risk assessment, they also use the software to detect signs of vicarious violence and suicide attempts by the perpetrator, which they then communicate to judges. He does emphasise that he believes that for the tool to work properly, people's participation is essential. “Clearly the introduction of the tool was a success, taking into account that since 2011, no victims protected by the Ertzaintza have been murdered in the Basque Country,” he adds.
Between 2002 and 2020, numbers regarding murdered victims of gender violence of the past two decades are similar: 47 women were murdered in the 18 year period, nearly half of them (20) since 2011. The statistical data collected by Emakunde, the Basque Institute for Women, does not specify whether these women were under police protection or not.
A downward trend
Fernandez told AlgorithmWatch that they can increase the risk assessment provided by EPV-R, but never reduce it: “We normally increase the grading on the risk offered by EPV-R.” Asked for the number of times they had resorted to this action since the tool has been in operation, he said that the Ertzaintza does not use historical data in their daily procedures, so it is not actively registered.
This trend, however, is in line with the results of a preliminary study carried out in 2022 on the performance of the algorithm. The report shows that when assessing cases that had been catalogued as high risk, half of the time (53%) the algorithm concluded that the risk was low.
“The number of false negatives is higher than true positives when the cutoff score is 10. This means that the assessment tool is more likely to classify severe cases as non-severe at this score, which could imply the underestimation of cases,” sums up the report. The cutoff score refers to the point where the creators of the algorithm would consider that the risk changes from severe to not severe.
“The problem with this algorithm is in the balance between true positive cases when the risk is high and the algorithm classifies it as such, and true negative cases when the algorithm says the risk is low and it does not fail,” says Valdivia, who is also one of the authors of the study. She explains why this is an issue: “Because in the worst case of non-severe cases, what will happen is that you will classify non-severe as severe and you will have more resources and protection. On the other hand, if you classify a severe case as non-severe, there is a problem”.
Julián García worked as an examining judge for 14 years in the Basque Country and contributed to the elaboration of the study. He recalls that in every case regarding gender violence, he would receive – along with the report containing the complaint, medical data and police documentation – a document with a risk assessment grade produced by a tool called EPV-R, but with little context on how it obtained the result.
“The questions in the tool place a high value on the use of weapons or signs of physical violence, therefore the result was not as reliable in cases where this circumstance was not documented”, he says, and remembers a case he examined in which a woman with a record of protective orders was being stalked by her partner. He would watch her from a bench in front of their house, ring her bell by night and wake her and her daughter up, who had to be treated for mental health problems. García granted a protection order and considered the risk was high, but EPV-R classified this particular case as low.
The effect of the algorithm’s results in court is not easily analysable. “When we make protection order in the duty court, we only have the statement, and normally the victim is summoned and the alleged aggressor is detained. It may be that the victim testifies and you can clearly see the conditions, the injuries, that the violence is not a one-off event but has been projected over the last few months… But there are other cases in which it is not so clear and the examining magistrates have to make a decision within an hour”, Garcia states.
In such cases a judge might be somehow motivated to check the EPV-R results, but academic literature shows that sometimes their own prejudice and bias outweigh the algorithm. Some studies claim that in most cases judges ignore the algorithmic recommendations, especially when it doesn’t match their understanding about social behaviour, for example when treating a racialized population while harbouring racist tendencies. Other experiments on this matter show that recidivism prediction by test participants would be similar with and without advice from the algorithm. In a minority of cases participants would react to the advice by changing their prediction, normally when the system predicts that the person will not reoffend.
Ujue Agudo is a psychologist researching on the interaction between people and automated decision-making systems at Bikolabs and has carried out several experiments to understand how much they are influenced by what an algorithm says. When it comes to risk assessment, she finds it difficult to quantify this parameter: “When the algorithm gets it right, people agree, but when it gets it wrong, they don't blindly listen, but it makes them doubt, so they end up changing their judgement,” she explains about a recent experiment she carried out when testing how people would react with a system like RisCanvi.
“If there are three different options – low, medium and high, people will tend to stay in the middle” for fear of being responsible for "convicting" someone, she adds. From a psychological perspective, this happens due to a combination of two phenomena: automation bias and delegation of responsibility. The problem we are still facing? That all of these tests are carried out with vulnerable populations like people in jail or gender violence victims, Agudo says.
Did you like this story?
Every two weeks, our newsletter Automated Society delves into the unreported ways automated systems affect society and the world around you. Subscribe now to receive the next issue in your inbox!