Location: Crop Improvement and Genetics Research
Title: Assessing the performance of generative artificial intelligence in retrieving information against manually curated genetic and genomic dataAuthor
![]() |
PORETSKY, ELLY - Oak Ridge Institute For Science And Education (ORISE) |
![]() |
BLAKE, VICTORIA - Montana State University |
![]() |
Andorf, Carson |
![]() |
Sen, Taner |
Submitted to: Database: The Journal of Biological Databases and Curation
Publication Type: Peer Reviewed Journal Publication Acceptance Date: 1/31/2025 Publication Date: 2/17/2025 Citation: Poretsky, E., Blake, V.C., Andorf, C.M., Sen, T.Z. 2025. Assessing the performance of generative artificial intelligence in retrieving information against manually curated genetic and genomic data. Database: The Journal of Biological Databases and Curation . https://doi.org/10.1093/database/baaf011. DOI: https://doi.org/10.1093/database/baaf011 Interpretive Summary: Curated resources at centralized repositories provide high-value service to users by enhancing data veracity. Curation, however, comes with a cost, as it requires dedicated time and effort from personnel with deep domain knowledge. We investigate the performance of a Large Language Model, ChatGPT (Chat Generative Pre-Trained Transformer), in extracting and presenting data against a human curator. We used a small set of journal articles on wheat genetics, focusing on traits, such as salinity tolerance and disease resistance. We developed a ChatGPT-based system and compared how ChatGPT performed in answering questions about traits and marker-trait associations. Our findings show that on average GPT4 correctly categorized manuscripts 90% of times, correctly extracted 82% of traits, and correctly extracted 63% of marker-trait associations. Technical Abstract: Curated resources at centralized repositories provide high-value service to users by enhancing data veracity. Curation, however, comes with a cost, as it requires dedicated time and effort from personnel with deep domain knowledge. In this paper, we investigate the performance of a large language model (LLM), specifically generative pre-trained transformer (GPT)-3.5 and GPT-4, in extracting and presenting data against a human curator. In order to accomplish this task, we used a small set of journal articles on wheat and barley genetics, focusing on traits, such as salinity tolerance and disease resistance, which are becoming more important. The 36 papers were then curated by a professional curator for the GrainGenes database (https://wheat.pw.usda.gov). In parallel, we developed a GPT-based retrieval-augmented generation question-answering system and compared how GPT performed in answering questions about traits and quantitative trait loci (QTLs). Our findings show that on average GPT-4 correctly categorized manuscripts 97% of the time, correctly extracted 80% of traits, and 61% of marker–trait associations. Furthermore, we assessed the ability of a GPT-based DataFrame agent to filter and summarize curated wheat genetics data, showing the potential of human and computational curators working side-by-side. In one case study, our findings show that GPT-4 was able to retrieve up to 91% of disease related, human-curated QTLs across the whole genome, and up to 96% across a specific genomic region through prompt engineering. Also, we observed that across most tasks, GPT-4 consistently outperformed GPT-3.5 while generating less hallucinations, suggesting that improvements in LLM models will make generative artificial intelligence a much more accurate companion for curators in extracting information from scientific literature. Despite their limitations, LLMs demonstrated a potential to extract and present information to curators and users of biological databases, as long as users are aware of potential inaccuracies and the possibility of incomplete information extraction. |