The app for independent voices

Sharing some model testing work that others might find interesting. I did this work for someone else, which is why I'm only sharing notes rather than full prompts and outputs.

I tested Google's Gemini Pro 3 Thinking and Gemini Pro 3 with Deep Research on a fairly straightforward research task: analyze Catalyst grant award winners. Create a list of winners and locations, check which became Digital Science portfolio companies, match logos in a graphic and video, and output as JSON. This was meant to be a realistic test of what these models can currently do using the kind of prompt a typical user would write.

Gemini Pro 3 Thinking

I ran multiple prompts with slight variations. Some responses returned "I'm having a hard time fulfilling your request." The worst just returned only the YouTube link from the prompt. The best produced a summary of the video contents, confidently listing ORCID, Cureus, Research Square, and Ithenticate as part of the Digital Science portfolio [none of which are actually Digital Science companies].

Gemini Pro 3 with Deep Research

Every run produced a ~2000-word essay plus a chunk of JSON. The reports were around 90-95% accurate on basic facts. The errors were often glaring - for example, Newsflo isn't a Digital Science portfolio company; it was acquired by Elsevier in 2015. Overstatement is common, for example:

"While the early years of the grant were dominated by the United States and the United Kingdom, the later years show a significant diversification in the origin of innovation. This global shift mirrors the broader trends identified in Digital Science's "State of Open Data" reports, which note that developing nations are increasingly becoming leaders in research into the United Nations Sustainable Development Goals (SDGs)."

Does this statement match the data? Not really. From 2011–2016, USA and UK-based companies dominated almost every year. From 2017, there were winners from Japan, Hong Kong, India, Belarus, Russia, Sweden, Denmark, and Australia, but the USA still dominates most years, including 2023 and 2025. Does the situation mirror developing nations becoming leaders? Not really - India and Belarus are represented in the data, but no African or South American countries. Referring to the Digital Science State of Open Data report is reasonable at a thematic level, but the Catalyst grants aren't related to SDGs, and this feels a bit random. The model has taken the available information and woven it into a coherent story that sounds plausible but doesn't actually hold up against the data.

Another example:

"The longevity of the program and the high rate of integration into the permanent portfolio suggest that the Catalyst Grant remains one of the most effective mechanisms for translating academic innovation into sustainable global research infrastructure."

This overstates the evidence. The grant program has run from 2011 to 2025 with brief pauses, so yes, it's long-running. A small minority of companies, such as Writeful, Ripeta, and Penelope have become portfolio companies. A small minority of companies, such as Authorea, TetraScience, and Newsflo, have successfully exited outside Digital Science. But "one of the most effective mechanisms for translating academic innovation into sustainable global research infrastructure"? That final claim is doing far more work than the evidence supports.

Logo matching tasks

There were two logo matching tasks in the prompt: note which Catalyst grant company logos appear in the 2024 company report, and note which appear in the video. The image matching results were typically fine. The video matching was very poor - maybe 10% correct on a good run. The models can't reliably distinguish between the Altmetric and Dimensions logos.

JSON outputs

Despite requesting that the model output JSON only, I always got an essay plus JSON. The JSON I did get was poorly formatted and invalid. The prompt specified what to put into the JSON file if information was unknown - it was hit and miss (largely miss) if this instruction was followed.

The data accuracy varied significantly by run. Ranging from ~25% accuracy to ~65%. The problems with the data were typically related to laziness rather than hallucination (tightening up prompt language/adding instructions about how to handle messy data improved the quality of the output). The responses typically included only part of the task: listed all of the companies but half of the associated data points, or listed half the companies and most of the associated data points.

How useful were the outputs?

If you wanted a JSON output with accurate data to base an analysis on, these results are poor, and the resulting analysis would be equally poor. If you wanted background information about Catalyst grant award winners, geographic location, and current status within Digital Science, the report is useful. It ties the story up into a positive, unifying narrative that will give you the big picture, but some of the finer details are wrong.

Worth noting that in 12 months, this will probably work fine. The gap between current capability and actual reliability on structured tasks is closing fast. Running the analysis in Notebook and specifying all of the data sources generates a much more accurate dataset.

Feb 8
at
1:05 PM
Relevant people

Log in or sign up

Join the most interesting and insightful discussions.