I plan to construct a comprehensive word frequency list in many languages,
with a column for each word containing the orthographic word, another
column with the phonemic transcription of the word in IPA, a third column
with the word frequency, and a fourth column with the language.
The words do not have to be translation equivalents, but could be the top N
thousand words for a given language.
The purpose is to then carry out cross-linguistic analyses of
distributional properties of phonemes and phonotactic sequences. Another
use could be to train grapheme-to-phoneme models for missing words in the
list for each language. Ideally, the final resource would be free to use
for research purposes.
Technically, the word list is easy to compile, but one has to rely on
limited open-source data. While frequency info can be easily obtained for
many languages from open sources, phonemic transcriptions are typically
available in proprietary dictionaries. Wiktionary is a good starting point,
but its coverage might be limited for some languages, and thus it can
provide valuable data for a handful of commonly spoken languages. I would
like to obtain a wider and more representative sample of world languages.
I am thus inviting anyone with tips/or resources available to contribute to
this compiling effort. Contributors would receive the due acknowledgments,
and could specify under which conditions they want their data to be used.
You can reply here or write to lucao(a)uio.no .
Best,
Luca Onnis, PhD
Professor, University of Oslo, Norway