Chat-rgie: precision extraction of rice germplasm data using large language models and prompt engineering

Published in Journal of Big Data, 2025

Varietal improvement is a key aspect of breeding, and as a result of this work, crop varietal data becomes more complicated, requiring more resources to extract. As a result, we developed Chat-RGIE, a rice germplasm data extraction strategy based on conversational large language models (LLM) and cue word engineering, to achieve rice germplasm data extraction in a ZERO-shot manner. The technique employs multi-response voting to limit the chance of phantom appearances, as well as an additional calibration component to choose the best data extraction findings. We performed performance evaluation and real-life data extraction evaluation on Chat-RGIE, and the scheme obtained 0.9102 precision, 0.9941 recall, and 0.9554 accuracy in performance evaluation, and 0.6351 precision, 1.0 recall, and 0.8225 accuracy in real-life data extraction evaluation, which completely proved the effectiveness of the scheme. Furthermore, the well-designed data extraction procedure mitigates the likelihood of potential bias from a single large model leading to hallucinations to some extent, with the incidence of hallucinations in the two evaluations being 0.0015 and 0.005, respectively, with a very minor influence. Furthermore, we employed Restraint Rate, a statistic used to quantify the degree of limits placed by the prompt on LLM replies, with values of 0.9265 and 0.911 in the two evaluations, resulting in normative responses. Furthermore, when we examined the data extraction results, we discovered that when confronted with an unanswerable answer, the LLM is affected by the stress provided by the prompt, and the higher the stress, the more likely it is to engage in constraint-violating behavior, which is similar to what humans do when stressed. We therefore believe that some of the countermeasures in the human behavior in question also have the potential to help improve LLM performance.

Recommended citation: Wei Yijin, Fan Jingchao. (2025). Chat-rgie: precision extraction of rice germplasm data using large language models and prompt engineering. Journal of Big Data, 12(1): 202.
Download Paper