The AI Scientist
Large language models, the AI algorithms taking the world by storm, are galvanizing academic research.
These algorithms scrape data from the digital world, learn patterns in the data, and use these patterns to complete a variety of specialized tasks. Some algorithms are already aiding research scientists. Some can solve challenging math problems. Others are “dreaming up” new proteins to tackle some of our worst health problems, including Alzheimer’s and cancer.
Although helpful, these only assist in the last stage of research—that is, when scientists already have ideas in mind. What about having an AI to guide a new idea in the first place?
AI can already help draft scientific articles, generate code, and search scientific literature. These steps are akin to when scientists first begin gathering knowledge and form ideas based on what they’ve learned.
Some of these ideas are highly creative, in the sense that they could lead to out-the-box theories and applications. But creativity is subjective. One way to gauge potential impact and other factors for research ideas is to call in a human judge, blinded to the experiment.
“The best way for us to contextualize such capabilities is to have a head-to-head comparison” between AI and human experts, study author Chenglei Si told Nature.
The team recruited over 100 computer scientists with expertise in natural language processing to come up with ideas, act as judges, or both. These experts are especially well-versed in how computers can communicate with people using everyday language. The team pitted 49 participants against a state-of-the-art LLM based on Anthropic’s Claude 3.5. The scientists earned $300 per idea plus an additional $1,000 if their idea scored in the top 5 overall.
The Human Critic
To make it a fair test, the judges didn’t know which responses were from AI. To disguise them, the team translated submissions from humans and AI into a generic tone using another LLM. The judges evaluated ideas on novelty, excitement, and—most importantly—if they could work.
After aggregating reviews, the team found that, on average, ideas generated by human experts were rated less exciting than those by AI, but more feasible. As the AI generated more ideas, however, it became less novel, increasingly generating duplicates. Digging through the AI’s nearly 4,000 ideas, the team found around 200 unique ones that warranted more exploration.
But many weren’t reliable. Part of the problem stems from the fact the AI made unrealistic assumptions. It hallucinated ideas that were “ungrounded and independent of the data” it was trained on, wrote the authors. The LLM generated ideas that sounded new and exciting but weren’t necessarily practical for AI research, often because of latency or hardware problems.
“Our results indeed indicated some feasibility trade-offs of AI ideas,” wrote the team.
posted by f.sheikh