Dear all,
If you are involved in (web) corpora creation and curation, interested in large multilingual corpora for European languages, or working with automatic genre annotation, the following resources might be useful for you. Multiple multilingual genre-related resources and technologies are now available on the CLARIN.SI and Hugging Face repositories: - ๐๐ฒ๐ป๐ฟ๐ฒ-๐ฒ๐ป๐ฟ๐ถ๐ฐ๐ต๐ฒ๐ฑ ๐ ๐ฎ๐๐ผ๐๐-๐๐ฒ๐ป๐ฟ๐ฒ ๐๐ฒ๐ฏ ๐ฐ๐ผ๐ฟ๐ฝ๐ผ๐ฟ๐ฎ - MaCoCu web corpora for 13 European languages (Albanian, Bosnian, Bulgarian, Catalan, Croatian, Greek, Icelandic, Macedonian, Montenegrin, Serbian, Slovenian, Turkish, and Ukrainian), automatically annotated with genre labels. In total, the corpus collection comprises 67 million texts and 28.5 billion words. They are available on the CLARIN.SI repository: http://hdl.handle.net/11356/1969
- ๐ซ-๐๐๐ก๐ฅ๐ ๐ฐ๐น๐ฎ๐๐๐ถ๐ณ๐ถ๐ฒ๐ฟ - multilingual text genre classifier, applicable to any of the 100 languages that are included in the XLM-RoBERTa model - available on Hugging Face (https://huggingface.co/classla/xlm-roberta-base-multilingual-text-genre-clas...) and CLARIN.SI repository (http://hdl.handle.net/11356/1961)
- ๐๐ป๐ด๐น๐ถ๐๐ต-๐ฆ๐น๐ผ๐๐ฒ๐ป๐ถ๐ฎ๐ป ๐ซ-๐๐๐ก๐ฅ๐ ๐ฑ๐ฎ๐๐ฎ๐๐ฒ๐ - manually-annotated genre dataset, used for training and evaluation of the X-GENRE classifier - available on Hugging Face (https://huggingface.co/datasets/TajaKuzman/X-GENRE-text-genre-dataset) and CLARIN.SI repository (http://hdl.handle.net/11356/1960).
Additionally, we set up a ๐ฏ๐ฒ๐ป๐ฐ๐ต๐บ๐ฎ๐ฟ๐ธ ๐ณ๐ผ๐ฟ ๐ฎ๐๐๐ผ๐บ๐ฎ๐๐ถ๐ฐ ๐ด๐ฒ๐ป๐ฟ๐ฒ ๐ถ๐ฑ๐ฒ๐ป๐๐ถ๐ณ๐ถ๐ฐ๐ฎ๐๐ถ๐ผ๐ป (https://github.com/TajaKuzman/AGILE-Automatic-Genre-Identification-Benchmark) for continuous evaluation of the emerging technologies on this task. The benchmark is based on unpublished manually-annotated datasets - if you wish to test your own systems on the task, let me know, and we'll be happy to share them with you.
Best regards,