Dear Corpora subscribers,
I'm pleased to announce the availability of two new corpora of automatic speech recognition transcripts from the YouTube channels of municipalities and other local government entities:
* The Corpus of Australian and New Zealand Spoken English (CoANZSE: https://cc.oulu.fi/~scoats/CoANZSE.html), a 196-million-word corpus of 57k transcripts from 482 YouTube channels, corresponding to 24k hours of video.
* The Corpus of German Speech (CoGS: https://cc.oulu.fi/~scoats/CoGS.html): 51m words, 1.3k channels, 39k transcripts, 7.2k hours of video.
The corpora were created using methods similar to those used to create the Corpus of North American Spoken English (https://cc.oulu.fi/~scoats/CoNASE.html) and the Corpus of British Isles Spoken English (https://cc.oulu.fi/~scoats/CoBISE.html). Transcript metadata includes location and video URL. Because tokens have word timing information, the corpora can serve as starting points for the collection of audio or video data targeting specific utterances.
The corpora are available free of charge for academic/research purposes. Download links are on the web pages.
With kind regards,
Steven Coats University of Oulu, Finland