Dear Luca,
we have one project that deals with norm data that we try to standardize. Not really what you ask for here, but you might want to check anyway:
Essentially, we assemble and try to standardize norm data across many languages and across various concepts, concepts are defined according to another project, the Concepticon (https://concepticon.clld.org) which provides a reference catalogue for elicitation glosses in comparative linguistics.
Best,
Mattis
On 03.04.24 17:04, Luca Onnis via Corpora wrote:
I plan to construct a comprehensive word frequency list in many languages, with a column for each word containing the orthographic word, another column with the phonemic transcription of the word in IPA, a third column with the word frequency, and a fourth column with the language.
The words do not have to be translation equivalents, but could be the top N thousand words for a given language.
The purpose is to then carry out cross-linguistic analyses of distributional properties of phonemes and phonotactic sequences. Another use could be to train grapheme-to-phoneme models for missing words in the list for each language. Ideally, the final resource would be free to use for research purposes.
Technically, the word list is easy to compile, but one has to rely on limited open-source data. While frequency info can be easily obtained for many languages from open sources, phonemic transcriptions are typically available in proprietary dictionaries. Wiktionary is a good starting point, but its coverage might be limited for some languages, and thus it can provide valuable data for a handful of commonly spoken languages. I would like to obtain a wider and more representative sample of world languages.
I am thus inviting anyone with tips/or resources available to contribute to this compiling effort. Contributors would receive the due acknowledgments, and could specify under which conditions they want their data to be used. You can reply here or write to lucao@uio.no mailto:lucao@uio.no .
Best,
Luca Onnis, PhD Professor, University of Oslo, Norway