CALL FOR PARTICIPATION AT IberLEF 2025
PastReader 2025 IberLEF Task on Transcription of Historical Content First edition - Transcribing texts from the past
Shared task website: https://sites.google.com/view/pastreader2025/home
Held as part of the evaluation forum IberLEF 2025 https://sites.google.com/view/iberlef-2025 in the XLI edition of the International Conference of the Spanish Society for Natural Language Processing (SEPLN 2025 https://eventos.ita.es/sepln_2025/inicio/)
September 23, 2025. Zaragoza, Spain
Dear All,
We are pleased to inform you that registration is now open for Task 'PastReader 2025: IberLEF Task on Transcription of Historical Content (First Edition) – Transcribing Texts from the Past.
The PastReader task was held as part of IberLEF 2025, the shared evaluation campaign for Natural Language Processing systems in Spanish and other Iberian languages, collocated with SEPLN 2025 Conference.
This is a novel task focusing on the correction of text extracted from digitized historical documents. Participants in this task must be able to generate clean and corrected versions of texts extracted via OCR from the Spanish historical press. The corrected text should be faithful to the original, and take into account common errors derived from the digitization and OCR process. For this edition, the collection is based on the Hemeroteca Digital of the National Library of Spain (BNE).
-
A dataset of digitized historical press from the BNE will be used. -
The collection contains millions of digitized pages of Spanish newspapers and magazines. -
The texts are in PDF format with OCR. -
The corpus includes publications from the 17th to the 20th century. -
The publications cover a wide variety of topics: politics, satire, humor, science, religion, illustration, entertainment, sports, art, and literature. -
The goal is to advance the automation of the transcription process.
TASK
Two tasks have been created related to the basic workflow in a transcription process: extraction of text from scanned documents (OCR) and curation of the extracted text to fix found errors:
-
Task 1: Error correction. In this task, participants will be provided with the output of an OCR system and will be asked to generate clean and corrected versions of the extracted texts. -
Task 2: End-to-end extraction. Due to the advance in multimodal systems, this task aims to explore end-to-end approaches, using scanned pages as input and expecting to produce curated texts as output.
DATA
For this shared tasks, three subsets of data have been prepared:
-
Training set: 8,959 pages (Scanned PDF, OCR output, and corrected text). -
Development set: 500 pages (Scanned PDF, OCR output, and corrected text). -
Test set: Subtask 1: 2,736 pages (OCR output only released to participants). Subtask 2: 2,736 pages (Scanned PDF only released to participants).
The quality of the OCR results varies due to several factors, such as the date of digitization, available technology, the state of preservation of the originals, and the complexity of the text structure. Efforts have been made to improve these texts, including collaborative corrections through the ComunidadBNE platform. The manually corrected output serves as a valuable resource for testing and training technology.
Participating in this task is a great opportunity to advance historical text processing. You'll work with a large dataset from the National Library of Spain (BNE), improving OCR correction skills and contributing to research. Your contribution will aid in digitizing historical documents for future access.
To participate, go to: https://forms.gle/iBwuUzjZdc2JyFDKA
IMPORTANT DATES
Feb 3rd: Registration open
Mar 17th: Release of training corpora
Mar 31st: Registration closed
Apr 7th: Release of test corpora and start of the evaluation campaign
Apr 14th: End of evaluation campaign (deadline for submission of runs)
Apr 18th: Publication of official results and release of test gold labels
May 12th: Deadline for paper submission
May 30th: Acceptance notification
Jun 16th: Camera-ready submission deadline
July 3rd: Final camera-ready submission deadline (to IberLEF organizers)
Sep, TBD: Publication of proceedings
Sep, TBD: IberLEF Workshop at SEPLN 2025
ORGANIZING COMMITTEE
- Arturo Montejo Ráez (Universidad de Jaén).
- Elena Sánchez Nogales (Biblioteca Nacional de España).
- Gloria Expósito Álvarez (Biblioteca Nacional de España).
- L. Alfonso Ureña López (Universidad de Jaén).
- María Teresa Martín Valdivia (Universidad de Jaén).
- Jaime Collado Montañez (Universidad de Jaén).
- Isabel Cabrera De Castro (Universidad de Jaén).
- María Victoria Cantero Romero (Universidad de Jaén).
- Ana García Serrano (UNED).
- Rocio Ortuño Casanova (UNED).
- Yanco Amor Torterolo Orta (UNED).
Best regards,
The PastReader 2025 organizing committee
[image: Universidad de Jaén] https://www.ujaen.es/ Arturo Montejo Ráez Profesor Titular de Universidad | Associated Professor (Tenured) amontejo@ujaen.es
Universidad de Jaén Departamento de Informática, A3-114 Las Lagunillas s/n, 23071 - Jaén (Spain) +34 953 212 882 https://www.ujaen.es/servicios/sinformatica/sites/servicio_sinformatica/files/piefirmacorreo4/index.html ORCID: http://orcid.org/0000-0002-8643-2714 Researcher ID: D-3387-2009 SINAI Research Group https://sinai.ujaen.es
[image: Universidad de Jaén] https://www.ujaen.es/ *Antes de imprimir este mensaje, piense si es necesario. Proteger el medio ambiente es cosa de todos.* *** CLÁUSULA DE CONFIDENCIALIDAD *** Este mensaje se dirige exclusivamente a su destinatario y puede contener información privilegiada o confidencial. Si no es Ud. el destinatario indicado, queda notificado de que la utilización, divulgación o copia sin autorización está prohibida en virtud de la legislación vigente. Si ha recibido este mensaje por error, se ruega lo comunique inmediatamente por esta misma vía y proceda a su destrucción.
This message is intended exclusively for its recipient and may contain information that is CONFIDENTIAL. If you are not the intended recipient you are hereby notified that any dissemination, copy or disclosure of this communication is strictly prohibited by law. If this message has been received by mistake, please let us know immediately via e-mail and delete it.