**
*Dear all,*
***The CLASSLA Knowledge centre for South Slavic languages https://www.clarin.si/info/k-centre/is delighted to announce the release of comparable web corpora for all official South Slavic languages, namely Slovenian https://www.clarin.si/ske/#concordance?corpname=classlaweb_sl(1.8 billion words), Croatian https://www.clarin.si/ske/#concordance?corpname=classlaweb_hr(2.2 billion words), Bosnian https://www.clarin.si/ske/#concordance?corpname=classlaweb_bs(802 million words), Montenegrin https://www.clarin.si/ske/#concordance?corpname=classlaweb_cnr(151 million words), Serbian https://www.clarin.si/ske/#concordance?corpname=classlaweb_sr(2.3 billion words), Macedonian https://www.clarin.si/ske/#concordance?corpname=classlaweb_mk(479 million words) and Bulgarian https://www.clarin.si/ske/#concordance?corpname=classlaweb_bg(3.3 billion words), all these corpora summing up to almost 11 billion words! The linguistic annotation was performed with the state-of-the-art CLASSLA-Stanza https://pypi.org/project/classla/toolkit, which you can now try out also through the CLASSLA annotator web interface https://clarin.si/oznacevalnik/eng! Additionally, each of the 26 million documents in these seven corpora is annotated with the X-GENRE multilingual genre classifier https://huggingface.co/classla/xlm-roberta-base-multilingual-text-genre-classifier, enabling creation of subcorpora based on genre information. Interestingly, while the corpora were developed with the same pipeline, the genre distribution across corpora is rather varying. If you are interested in more details on the sizes, genre distributions and additional insights, we warmly invite you to read our blog post https://www.clarin.si/info/k-centre/comparable-classla-web-corpora-of-south-slavic-languages/.*
**** Best regards,
Nikola Ljubešić, Taja Kuzman and many other CLASSLAers