Audit AI Search Tools Now, Before They Skew Research
A personal take on science and society – World view
Generative AI could be a boon for literature search, but only if independent groups scrutinize its biases and limitations.
Michael Gusenbauer is a researcher at the Institute of Innovation Management at Johannes Kepler University in Linz, Austria, and the founder of Search Smart.
AI-assisted search tools will undoubtedly have an impact on research. The question is: how?”
Search tools assisted by large language models (LLMs) are changing how researchers find scholarly information. One tool, scite Assistant, uses GPT-3.5 to generate answers from a database of millions of scientific papers. Another Elicit, uses an LLM to write its answers to searches for articles in a scholarly database. Consensus finds and synthesizes research claims in papers, whereas SciSpace bills itself as an ‘AI research assistant’ that can explain mathematics or text contained in scientific papers. All of these tools give natural-language answers to natural-language queries.
Search tools tailored to academic databases can use LLMs to offer alternative ways of identifying, ranking, and accessing papers. In addition, researchers can use general artificial intelligence (AI)-assisted search systems, such as Bing, with queries that target only academic databases such as CORE, PubMed, and Crossref.
All search systems affect scientists’ access to knowledge and influence how research is done. All have unique capabilities and limitations. I’m intimately familiar with this from my experience building Search Smart, a tool that allows researchers to compare the capabilities of 93 conventional search tools, including Google Scholar and PubMed. AI-assisted, natural-language search tools will undoubtedly have an impact on research. The question is: how?
The time remaining before LLMs’ mass adoption in academic search must be used to understand the opportunities and limitations. Independent audits of these tools are crucial to ensure the future of knowledge access.
All search tools assisted by LLMs have limitations. LLMs can ‘hallucinate’: making up papers that don’t exist, or summarizing content inaccurately by making up facts. Although dedicated academic LLM-assisted search systems are less likely to hallucinate because they are querying a set scientific database, the extent of their limitations is still unclear. And because AI-assisted search systems, even open-source ones, are ‘black boxes’ — their mechanisms for matching terms, ranking results and answering queries aren’t transparent — methodical analysis is needed to learn whether they miss important results or systematically favour specific types of papers, for example. Anecdotally, I have found that Bing, scite Assistant and SciSpace tend to yield different results when a search is repeated, leading to irreproducibility. The lack of transparency means there are probably many limitations still to be found.
Already, Twitter threads and viral YouTube videos promise that AI-assisted search can speed up systematic reviews or facilitate brainstorming and knowledge summarization. If researchers are not aware of the limitations and biases of such systems, then research outcomes will deteriorate.
Regulations exist for LLMs in general, some within the sphere of the research community. For example, publishers and universities have hammered out policies to prevent LLM-enabled research misconduct such as misattribution, plagiarism or faking peer review. Institutions such as the US Food and Drug Administration rate and approve AIs for specific uses, and the European Commission is proposing its own legal framework on AI. But more-focused policies are needed specifically for LLM-assisted search.
In working on Search Smart, I developed a way to assess the functionalities of databases and their search systems systematically and transparently. I often found capabilities or limitations that were omitted or inaccurately described in the search tools’ own frequently asked questions. At the time of our study, Google Scholar was researchers’ most widely used search engine. But we found that its ability to interpret Boolean search queries, such as ones involving OR and AND, was both inadequate and inadequately reported. On the basis of these findings, we recommended not relying on Google Scholar for the main search tasks in systematic reviews and meta-analyses (M. Gusenbauer & N. R. Haddaway Res. Synth. Methods 11, 181–217; 2020).
Even if search AIs are black boxes, their performance can still be evaluated using ‘metamorphic testing’. This is a bit like a car-crash test: it asks only whether and how passengers survive varying crash scenarios, without needing to know how the car works internally. Similarly, AI testing should prioritize assessing performance in specific tasks.
LLM creators should not be relied on to do these tests. Instead, third parties should conduct a systematic audit of these systems’ functionalities. Organizations that already synthesize evidence and advocate for evidence-based practices, such as Cochrane or the Campbell Collaboration, would be ideal candidates. They could conduct audits themselves or jointly with other entities. Third-party auditors might want to partner with librarians, who are likely to have an important role in teaching information literacy around AI-assisted search.
The aim of these independent audits would not be to decide whether or not LLMs should be used, but to offer clear, practical guidelines so that AI-assisted searches are used only for tasks of which they are capable. For example, an audit might find that a tool can be used for searches that help to define the scope of a project, but can’t reliably identify papers on the topic because of hallucination.
AI-assisted search systems must be tested before researchers inadvertently introduce biased results on a large scale. A clear understanding of what these systems can and cannot do can only improve scientific rigour.