New study: Researching chatbots ahead of 2024 state elections
Large Language Models Continue To Be Unreliable Concerning Elections
By our research on Large Language Models and election information, we were able to substantially improve the reliability of safeguards in the Microsoft Copilot chatbot against election misinformation in German. However, barriers to data access greatly restricted our investigations into other chatbots.
This report is the third in our series on Large Language Models (LLMs) and elections, following work in September-October 2023 and August-September 2024. Our aim was to investigate chatbot answers at a larger scale than is usually deployed in this sort of research, to test the feasibility of locating systematic patterns for forming recommendations.
About the report in short
- We investigated four models - Microsoft Copilot, Google Gemini, and GPT-3.5 and GPT-4o from OpenAI. Our cases were 3 regional elections in the states of Thuringia and Saxony (1 September 2024) and Brandenburg (22 September 2024).
- We reported findings related to the 1 September elections to companies, and used the 22 September elections to see if outputs changed in response. We saw changes from Microsoft, but not from other companies.
- We developed 817 prompts across a range of 14 topics - ranging from voting dates to candidate positions - including all three states, major parties, and their candidates.
- We collected a total of 598,717 responses 29 July–30 September. For more details on how we developed prompts and collected data, consult our methodology appendix. Our data collection was conducted with the technical assistance of CASM Technology.
- In line with previous research, we found that models still produce inaccurate answers, including false scandals and incorrect election dates. We also found that issues often seem to be repeated, rather than being “one-off” mistakes.
- While we present some potential patterns and feasible methods for further investigations, we largely found that these investigations are challenging due to (i) often limited access to realistic data and (ii) the diversity of issues identified across different models and topics, meaning no “one size fits all” research method worked at scale.
- All models were able to reliably recognise and rebut calls for violence against politicians.
What we found
We were able to investigate Microsoft Copilot in more detail than the other models. This is because (i) we were able to automate data collection directly from the chatbot, and (ii) we were provided with some basic usage data via a data access request to Microsoft.
- Election safeguards on the Copilot model still work unevenly, although they have considerably improved since our first findings in August – from ~35% of questions being blocked in August, to ~80% now.
- The safeguards are more effective for some topics than others. Questions about election processes were frequently blocked (~80% of the time). However, prompts about parties or candidates were blocked less often, sometimes only 2% of the time.
- When Copilot answered questions, manual analysis suggests that only around 5-10% of its answers had clear factual inaccuracies. It also frequently (93% of answers) supplied links, largely from reliable sources.
- From usage data supplied by Microsoft, we saw clear spikes in queries likely related to the elections around the day of each election (a few thousand queries over the 4-day period around each election). With available data, it is not clear if these are people querying to inform their vote, or interest in the results of the elections.
- The process of data access was uncertain and challenging for both us and Microsoft. In particular, we had limited opportunities to re-request data based on initial findings from the first data packet - we mostly had "one shot" to get relevant data, rather than using an iterative approach of testing and refining as would be usual in this sort of research. In our response to the EU consultation on data access, we make specific proposals for an effective process for data access.
Analysis of Google Gemini, and GPT-3.5 and 4o by OpenAI, was more challenging. Due to technical limitations placed by the companies, we could not automate collection from the chatbots, but only access data via Application Programming Interfaces (APIs). APIs are a more technical way than chatbots to ask questions of models, and also differ from chatbots in features such as parameters or metaprompts which can affect outputs. As such, data collected via APImay only give limited insight into how “normal” users experience the chatbots; chatbots. Nonetheless, based on the API data, we found:
- Google Gemini has a high rate of inaccuracy: ranging from 45% inaccurate answers prior to the Thuringia and Saxony elections, to 60% prior to the Brandenburg elections.
- In practice, the Gemini chatbot, when accessed via browser, is extremely effective at blocking election-related queries, so normal users should not be affected by these inaccuracies. However, it raises questions regarding why the same safeguards are not also applied at the API level, given this very high inaccuracy rate, and also whether similar problems may appear in the increasing integration of Gemini into other technologies e.g. search.
- The GPT-3.5 and GPT-4o models showed consistent levels of inaccuracy (~30% for GPT-3.5 and ~14% GPT-4o) across both election periods. A consistent problem was outdated information, including in the latest GPT-4o model. This was not always clearly flagged.
- Another consistent problem across models was overconfident extrapolation from limited information, particularly for the new party Bündnis Sahra Wagenknecht (BSW) and for some candidates. In these cases we saw plausible-sounding but inaccurate answers which, e.g., invented names for the BSW party or combined information from multiple people with the same name as the candidates.
- Neither Gemini nor either GPT model frequently supplied links to source material (less than 6% of answers), unless explicitly prompted to.
Our recommendations
- More access needed: Public interest data and researcher access programmes into LLMs must be extended. This cannot be limited to API access, which does not allow for fully realistic investigations. For the one company who worked with us - Microsoft - we were able to help improve their safeguards. However, in general this work was challenging and resource-intensive, due to the diversity of potential issues raised by the LLMs, which calls for a similarly diverse range of approaches. The research and NGO community must be enabled, not burdened, in doing this work.
- Guardrails for candidates, not just parties: Election safeguards must include a focus on candidates. We found that Copilot’s safeguards were less effective at blocking questions about candidates, and that other models either (i) do not give information on candidates or (ii) extrapolate from limited information into invented stories, even scandals. Safeguards around candidates are particularly important given the implications for these individuals.
- Companies should take responsibility for limitations: Generic caveats to “check information” are not helpful safeguards for important topics like elections, given many users may be turning to chatbots for easy answers. Models should guide users to reliable information from human authors, rather than try to summarise information themselves (with risks of misstatement). Models should also be clear about how up-to-date their information is; and further steps should be taken to avoid extrapolating from limited information, e.g. about less well-known candidates or parties.
Read more on our policy & advocacy work on ADM in the public sphere.