Nihar Shah Discusses Large Language Models and Peer Review Challenges in Science

Nihar Shah, a prominent artificial intelligence (AI) researcher and associate professor at Carnegie Mellon University, presented a seminar at the Center for Language and Speech Processing (CLSP) on October 10, titled “LLMs in Science, the good, the bad and the ugly.” The seminar examined the role of AI in scientific research and the peer review process.

Shah focused on the changing role, potential, and implications of large language models (LLMs) as they gain traction in research communities worldwide. He initiated the discussion by tackling a long-standing issue in academic research: the effectiveness of peer review.

In a study involving prestigious conferences like NeurIPS, AAAI, and ICML, Shah and his team assessed reviewers” abilities to identify errors in submitted manuscripts. They introduced deliberate errors into three different papers: one contained a glaring mistake, another had a less obvious flaw, and the last featured a very subtle error. Out of 79 reviewers, Shah reported that in analyzing the paper with the obvious error, 54 did not comment on the erroneous sections, 19 believed the content was sound, while only one reviewer expressed concern, stating, “this looks really fishy.” These results highlighted a significant weakness in the current peer review system, primarily due to the ongoing pressure and time limitations faced by reviewers.

Shah also reflected on the importance of ethical scientific practices, noting that fraud has become increasingly prevalent in the peer review process. He pointed to collusion rings and the manipulation of paper assignments through selective bidding. Another issue is the self-selection of papers by reviewers when uploading to conference portals, leading to inaccurately represented expertise. Moreover, there have been extreme cases where individuals use fake email accounts, often linked to accredited institutions, to impersonate qualified reviewers.

To address these challenges, Shah proposed measures such as the implementation of trace logs, which provide a detailed, timestamped account of when reviewers accessed various components of a manuscript, the tools they utilized, and the comments they submitted. This approach aims to deter reviewers from bypassing sections of a paper and fabricating analyses they claim to have performed. Despite these safeguards, the peer review process remains vulnerable to human error and fatigue.

Shifting from human to machine evaluation, Shah compared the efficacy of advanced LLMs like OpenAI”s GPT-4 to that of human reviewers. LLMs consistently detected the most glaring flaws, “across many, many runs every single time.” However, identifying the less obvious error required rephrasing the prompt to direct the LLM”s attention to that specific part of the manuscript. Shah remarked, “When you specifically asked to look at that , said, “Oh, yeah, here”s the problem.”” This indicates that while LLMs can effectively pinpoint issues when guided, they cannot yet supplant human expertise.

In concluding his seminar, Shah delved into the emergence of AI scientists—systems capable of generating hypotheses, designing experiments, and writing research papers autonomously. “When you give it some broad direction, it can do all the research, including generating a paper,” Shah explained. He highlighted the vast potential of AI scientists to significantly expedite the discovery process by streamlining routine tasks that currently occupy human researchers.

However, Shah warned about the numerous challenges associated with AI scientists, including the generation of artificial data sets, engaging in p-hacking, and selectively reporting benchmarks. His team discovered instances where AI scientists would only select and report the most favorable outcomes.

Shah concluded with a vital takeaway: LLMs offer tremendous opportunities but also present challenges. He urged the audience to approach the adoption of AI with a critical eye, particularly concerning the integrity of scientific research.