[Corpora-List] Publication of the LASLA and LASLA CoNLL-U corpus

2 Oct 2023


      Dear all,
We are happy to announce that the LASLA Latin corpus has been published Open Access under a CC-BY-NC-SA 4.0 license. The portion of the LASLA corpus published comprises ca 1.7 million tokens of works from the Classical period, manually annotated with the following information: lemma, Part-of-Speech, morphological features, partial syntactic information, and metadata.  The LASLA has ongoing annotation projects, whose results will be uploaded to the Dataverses when they are finalised. We hope to provide a service to the community focusing on Latin linguistics and Latin literary studies, as well as to serve the most recent NLP trends.
The corpus can be accessed in three Dataverses, each containing one specific format. We recommend using the “Tree View” to have an idea of what files can be found in the Dataverse.
*   DAT and APN (resp. https://doi.org/10.58119/ULG/27VZID  and https://doi.org/10.58119/ULG/QJJ0SA) are published with detailed documentation on the codes used and all the annotation choices implemented by the LASLA across the years. We hope that such documentation can support an optimal exploitation of the data by external researchers.
  *   BPN files (https://doi.org/10.58119/ULG/49UQNU), which were previously shared with Data Transfer Agreements with external partners. Beyond documentation purposes, this  Dataverse also provides the original version on which the CoNLL-U format was based (see below)
The LASLA files can be exploited via (free) online interfaces: Opera Latinahttp://cipl93.philo.ulg.ac.be/OperaLatina/ (for which an account can be requested by contacting Lauren Simon, email L.Simon@uliege.bemailto:L.Simon@uliege.be), which enables structured searches through the files; HyperbaseWebhttp://hyperbase.unice.fr/hyperbase/ (Latin bases), for which you find documentation herehttps://margheritafantoli.wordpress.com/2021/04/22/having-fun-with-hyperbaseweb-and-the-english-royal-family/ and herehttps://margheritafantoli.wordpress.com/2021/04/22/having-fun-with-hyperbaseweb-and-the-english-royal-family-ii/, and that does not require an account. HyperbaseWeb allows complex statistical queries to be carried out.
Following the Data Transfer Agreement for BPNs, an intense collaboration with the LiLa ERC projecthttps://lila-erc.eu/ started. The output of this collaboration is the following:
*   The LASLA corpus is linked to the LiLa Knowledge Base and can be queried, jointly with all the other resources linked, via the LiLa Interactive Search Platformhttps://lila-erc.eu/LiLaLisp/ and SPARQLhttps://lila-erc.eu/sparql/ endpoint. The triples of the linking are published openly here.
  *   The LiLa team has converted the BPN files into CoNLL-U files, enriching the annotation with the URIs of tokens and lemmas as they are found in the LiLa Knowledge Base. This version of the corpus can be found on Zenodohttps://doi.org/10.5281/zenodo.5961377 and Githubhttps://github.com/CIRCSE/LASLA.
We hope that this collaboration will trigger many others, with other partners enriching and providing new exploitation pathways for the LASLA corpus.
For the moment, have fun!
With kind regards,
The LASLA and LiLa teams
Prof. Marco C. Passarotti
Computational Linguistics
Index Thomisticus Treebank https://itreebank.marginalia.it/
ERC Grantee, P.I. LiLa https://lila-erc.eu/ (Grant Agreement No. 769994)
CIRCSE Research Centre https://centridiricerca.unicatt.it/circse_index.html
[cid:38DBA4B0-3169-48DD-B59A-4F3A679F9DD9@lan]   [cid:D415BF3A-E244-4BC4-9FB5-064066B300AD@lan]  [cid:13BA173A-59CB-4F2D-9B90-DE302E870A50@lan]
[http://static.unicatt.it/ext-portale/5xmille_firma_mail_2023.jpg] https://www.unicatt.it/uc/5xmille

2026

2025

2024

2023

2022

[Corpora-List] Publication of the LASLA and LASLA CoNLL-U corpus