Dear All,
I'm wondering if there is a more recent version of the "European Parliament Proceedings Parallel Corpus 1996-2011" (https://www.statmt.org/europarl/). Alternatively, any experience downloading the EP proceedings using https://data.europarl.europa.eu/en/developer-corner/opendata-api would be welcome.
Thanks!
Alexandr
Dear Alexandr
As mentioned below, the DCEP corpus is not a recent version but it avoids overlap with Europarl corpus containing documents produced between 2001 and 2012.
Please find more details below:
The Digital Corpus of the European Parliament (DCEP) contains the majority of the documents published on the European Parliament's official website. It comprises a variety of document types, from press releases to session and legislative documents related to the European Parliament's activities and bodies.
The current version consists of various document types covering a wide range of subject domains. With a total of 1.37 billion words in 23 languages (253 language pairs), gathered in the course of ten years, this is the largest single release of documents by a European Union institution. It includes different document types produced between 2001 and 2012, excluding only the documents already existing in the Europarl corpus to avoid overlapping.
To download and for more information, see: https://ec.europa.eu/jrc/en/language-technologies/dcep
For a more detailed description of DCEP and when making reference to DCEP in scientific publications, please refer to:
Hajlaoui Najeh, Kolovratnik David, Väyrynen Jaakko, Steinberger Ralf, and Varga Dániel (2014). DCEP-Digital Corpus of the European Parliament. Proc. LREC 2014 (Language Resources and Evaluation Conference). Reykjavik, Iceland. Mai 26-31, 2014. pp 3164-3171 (URL: http://www.lrec-conf.org/proceedings/lrec2014/pdf/943_Paper.pdf).
Best regards Najeh
Le mar. 25 juil. 2023 à 12:43, alexandr.rosen--- via Corpora < corpora@list.elra.info> a écrit :
Dear All,
I'm wondering if there is a more recent version of the "European Parliament Proceedings Parallel Corpus 1996-2011" ( https://www.statmt.org/europarl/). Alternatively, any experience downloading the EP proceedings using https://data.europarl.europa.eu/en/developer-corner/opendata-api would be welcome.
Thanks!
Alexandr _______________________________________________ Corpora mailing list -- corpora@list.elra.info https://list.elra.info/mailman3/postorius/lists/corpora.list.elra.info/ To unsubscribe send an email to corpora-leave@list.elra.info
Dear Najeh,
Thanks a lot, at first I didn't realize that the DCEP corpus includes so many other documents than the Europarl corpus, so I'll definitely consider using it. However, I'm more interested in more recent EP texts, preferably proceedings, mainly due to newly added languages (such as Croatian) and, indeed, topics, including vocabulary etc.
Best,
Alexandr