Dear editor, I am Bin Li, one of the organizers of EvaHan2026. Would you spread this CFP to the corpora list? Thank you so much!
--
Best wishes! Bin Li Phone: (86)13813878144 Homepage: http://cognitivebase.com/lib/ School of Chinese Language and Literature, Nanjing Normal University,China
CFP | EvaHan2026 Ancient Chinese OCR Shared Tasks EvaHan 2026
https://github.com/GoThereGit/EvaHan EvaHan 2026 is the Fifth International Evaluation of Ancient Chinese Information Processing, focusing on OCR tasks for multimodal large language models in ancient Chinese. Co-organized with LT4HALA 2026@LREC 2026, which will be held from May 11 to 16, 2026, in Mallorca, Spain. EvaHan 2026 is organized by Dongbo Wang, Bin Li, Minuxan Feng, Chao Xu, Weiguang Qu, Liu Liu, Si Shen. Previous Tasks: EvaHan 2022 The First Bake-off of Ancient Chinese Automatic Processing was successfully held in Marseille, France, in 2022, with a focus on automatic word segmentation and part-of-speech tagging of ancient Chinese. EvaHan 2023 The Second Bake-off of Ancient Chinese Automatic Processing was successfully held in Macau, China, in 2023, with a focus on machine translation of ancient Chinese. EvaHan 2024 The Third Bake-off of Ancient Chinese Automatic Processing was held in Turin, Italy, in 2024, with a focus on automatic sentence segmentation and punctuation of ancient Chinese. EvaHan 2025 The Fourth Bake-off of Ancient Chinese Automatic Processing was held in New Mexico, USA, in 2025, with a focus on named entity recognition in ancient Chinese. Important Dates for EvaHan 2026: Registration deadline: January 30, 2026 Training data release: January 1, 2026 Test data release: February 1, 2026 Running results submission: February 6, 2026 Technical report submission deadline: February 28, 2026 Notification of acceptance: March 1, 2026 Camera-ready papers due: March 10, 2026 Participation To participate in EvaHan 2026, you must complete the following steps: Registration: Submit a registration form to officially register your team for the task. Registration is open from December 1, 2025, to January 30, 2026. Only registered participants will gain access to the training dataset. Accessing the Training Data: After completing the registration process, participants will receive instructions for downloading the training dataset, which includes image--text pairs from ancient Chinese texts for OCR. Submitting Results and Reports: Participants must use the provided test data to generate results and submit their system outputs and a technical report as per the shared task schedule. For inquiries or to request the registration form, please contact us at evahan2026@gmail.com. Data The Evahan 2026 dataset comprises three datasets, covering image-text pairs: plain text images, mixed image-text images, and handwritten images-text. The data underwent initial automatic annotation, followed by meticulous correction and refinement by experts in classical Chinese language and history to ensure the highest quality of the training materials and gold-standard texts. ● Dataset A ( Printed Texts) consists of data selected from the Siku Quanshu (Complete Library of the Four Treasuries), including classics, history, philosophy, and literature, as well as various other ancient books. ● Dataset B (Mixed Layouts) contains mixed image-text data selected from the Siku Quanshu and other ancient books. ● Dataset C (Handwritten Texts) includes handwritten ancient books, primarily the Chinese Buddhist canon, including the Chinese Buddhist canon (TKH) dataset, and the Chinese Buddhist canon (MTH) dataset. Training Data The training set consists of designated portions of subsets A, B, and C. All training samples are provided in image-text pair format, with text in Traditional Chinese (UTF-8), approximately 5000-10000 image-text pairs per subset. Registered participants will receive the training data via email. Test Data The test set includes the remaining unseen portions of subsets A, B, and C to ensure comprehensive evaluation of all three challenge types. The data is also provided in image-text pair format, approximately 200-500 image-text pairs per subset. Detailed information and a download link for the test data will be provided to participants before the start of the formal evaluation period. Task This section offers a detailed description of the tasks encompassed in EvaHan 2026. OCR In many Chinese language processing systems,OCR is a critical task, often performed in parallel with other processing functions. The accuracy and speed of OCR directly determine the overall system's performance and user experience in downstream applications such as document digitization, information extraction, and intelligent retrieval. Evaluation Metrics Each team will only have access to the training data. Later, unlabeled test data will also be released. After the evaluation is complete, the labels for the test data will also be released. Tables 2,3 and 4 provide examples of the scorer output. The evaluation will align the system-generated text with the gold standard. Next, OCR will be evaluated: precision, recall, and F1 score will be calculated. BLEU ROUGE-1, ROUGE-2, and ROUGE-L will also be evaluated, bringing the competition's evaluation to multiple metrics. This evaluation adds layout analysis metrics: mAP and IoU. T he team's final ranking will be based on the overall score. The final ranking of teams will be based on the combined scores. Two Modalities Each participant can submit results for both modes. In the closed mode, each team has limited resources. Each team can only use training data and a pre-trained model. This model is a word embedding pre-trained on a large Traditional Chinese corpus. No other resources are allowed in the closed mode. In the open mode, there are no restrictions on resources, data, or models. Annotated external data, such as processed images or text, may be used. However, each team must disclose all resources, data, and models used in each system in the final report. How to Participate Registration time is mentioned above. Participants will be required to submit their runs and to provide a technical report for the task they participated in. Submitting Runs Each team can submit runs for two tasks. A run should be produced according to the closed modality. The second run will be produced according to the open modality. The closed run is compulsory, while the open run is optional. Once the system has produced the results for the task over the test set, participants have to follow these instructions to complete their submission: The annotated results should be submitted as three plain text files encoded in UTF-8 (four-byte encoding). The specific submission format will be released along with the pre-trained dataset. Organizers Dongbo Wang, College of Information Management, Nanjing Agricultural University, China Bin Li, School of Chinese Language and Literature, Nanjing Normal University, China Minxuan Feng, School of Chinese Language and Literature, Nanjing Normal University, China Chao Xu, School of Chinese Language and Literature, Nanjing Normal University, China Weiguang Qu, School of Computer and Electronic Information /School of Artificial Intelligence, Nanjing Normal University, China Liu Liu, College of Information Management, Nanjing Agricultural University, China Si Shen, School of Economics and Management, Nanjing University of Science and Technology, China Student Members Dongmei Zhu, College of Information Management, Nanjing Agricultural University, China Jieqiong Li, College of Information Management, Nanjing Agricultural University, China Ruifeng Wu,College of Information Management, Nanjing Agricultural University, China Junyi Yang,College of Information Management, Nanjing Agricultural University, China Zhixing Xu, School of Chinese Language and Literature, Nanjing Normal University, China Junjie Li, School of Chinese Language and Literature, Nanjing Normal University, China Yue Zhu, School of Chinese Language and Literature, Nanjing Normal University, China Mengting Xu, School of Chinese Language and Literature, Nanjing Normal University, China