Information Extraction from Chinese Wheat Varieties Journal Based on Large Language Model

Published in Frontiers of Data & Computing, 2025

[Objective] In order to promote the transformation of wheat germplasm resources to wheat industry advantages and to improve the richness of wheat genetic background, this paper is based on Large Language Model (LLM) and cue word engineering to conduct information mining for the published three-volume Chinese Wheat Variety Journal. [Methods] Scanning the paper version of the Chinese Wheat Variety Journal and carrying out OCR recognition and other data processing work to obtain wheat variety data, constructing the key extraction indexes of wheat variety data and the corresponding cue words of the large language model for the needs of breeding work, automating the extraction of the key information of the wheat variety data by calling commercial LLM api interfaces, and forming a mature set of work plan for extracting the information of wheat varieties based on the large language model. We have formed a set of mature working program for wheat variety information extraction based on LLM. [Results] The calculation of precision rate, recall rate and F1 value in terms of the number of actually existing relations, the number of recognized relations, and the number of correctly recognized relations in the information extraction task showed that this wheat varietal journal information extraction scheme achieved more than 0.89 precision rate, more than 0.73 recall rate, and more than 0.84 F1 value in the information extraction for the three volumes of Chinese Wheat Varietal Journal that have been published. [Conclusions] The high accuracy of this wheat varietal journal information extraction scheme indicates that it is fully capable of achieving precise information extraction, but the recall rate also indicates that the scheme has the problem that some information cannot be recognized, so although the scheme is overall feasible in terms of the combined F1 score, further manual verification and review of the extraction results is still required.

Recommended citation: Wei Yijin, Chen Yanqing, Wang Xiudong & Fan Jingchao. (2025). Information Extraction from Records of Chinese Wheat Varieties Based on Large Language Models. Frontiers of Data and Computational Development (Chinese and English), 7(01), 175−185.
Download Paper

Share on

Bluesky Facebook LinkedIn X (formerly Twitter)

韦一金 Yijin Wei

Share on