Dear Professor Wan,
Thank you for the pointer to your work — I'll read through the site and your posts. I appreciate the quickness of your response and look forward to reading more of it.
I think our concerns may be partly orthogonal. The paper's headline measure is phone-level (schwa proportion in CMUdict transcriptions), not word-level; "word" enters only as an operational unit for the Flesch-Kincaid baseline we compare against, and the scope is bounded to English prose register classification on four named corpora. We don't claim cross-linguistic generality or make prescriptive claims about "language."
That said, your point about the undefined status of "word" across writing systems is well-taken, and I'll add a scope/limitations note making the English-and-CMUdict dependency explicit rather than implicit. It does seem important to identify and accommodate implicit anglophone bias in scientific contexts.
Best, Kyle
On Thu, Apr 16, 2026 at 3:22 PM Ada Wan adawan919@gmail.com wrote:
Dear Kyle
Please be notified of my findings from 2019 on (see sites.google.com/view/adawan) as well as all my posts and replies/comments on X.com since 2021 (@adawan919) and on LinkedIn.com. Please note that working on/with "word(s)" can be considered a violation of research integrity and/or of the law.
Feedback: While I understand that most of my findings might seem distant to those from "linguistics proper", the most important takeaway is the same: "word" is not a reliable unit for scientific work. And working on "grammar" and "language" is / can be unethical. If you can transition to research without working on or leveraging "w/s/ls/g/l", that'd be optimal. Otherwise, an immediate feedback to your work would be just do it without "word(s)". At the first sight, evaluating schwa density of a particular dataset is not wrong in itself --- in fact, from the perspective of linguistics or "'language' science" (except my findings show that there cannot be any more science with "language"), it could even be avant-garde work to estimate, without "words", schwa density of a given document and/or compare it with another document. But when one considers how in the context of "language" (or "w/s/ls/g/l") being un- and under-defined, and often defaulting subliminally to a prescriptive grammarian perspective, it'd be better and safer to simply refrain from publishing this kind of philological contributions (and yes, I understand that within linguistics, this work would/could be considered scientific/rigorous already, but that is not enough).
Feel free to let me know if you should have any questions.
Best regards Ada Wan https://sites.google.com/view/adawan
On Thu, Apr 16, 2026 at 8:17 PM Kyle Townsend via Corpora < corpora@list.elra.info> wrote:
Dear colleagues,
I'd like to share a new preprint on single-feature register classification in English text:
"Schwa Density as a Phonological Stylistic Classifier: Primary Stylistic, Secondary Modality -- A Four-Corpus Pre-Registered Replication"
Preprint:https://ling.auf.net/lingbuzz/009926/current.pdf?_s=WPGovroKhmABLC0P Materials/code: https://github.com/kylegtownsend-collab/schwa-density-spgc Paper site: https://papers.letsharkness.com/schwa-density/
The paper tests whether schwa density -- the proportion of vowel phones in a text that are unstressed schwa (CMUdict AH0) -- can serve as a phonologically motivated single-feature register classifier. A pre-registered confirmatory plan was applied to NLTK multi-source (N=164) and the Standardized Project Gutenberg Corpus (N=2,767), with sensitivity analyses on Brown (N=313) and OANC (N=4,375).
Headline findings:
Schwa density matches or exceeds Flesch-Kincaid on all pre-registered corpora.
A function-word ablation (masking the 198 NLTK English stopwords before computing schwa density) preserves or amplifies register discrimination on all four corpora (eta^2 retention 0.93-1.27), ruling out stopword frequency as a confound.
The ablation operationalises a two-regime finding: schwa density functions as a Primary Stylistic Feature on within-prose variation (NLTK, SPGC, Brown) and a Secondary Modality Feature on speech-versus-writing variation (OANC).
Joint partial-eta^2 retains 46-53% of the register signal on the pre-registered corpora after controlling jointly for syllables per word, mean word length, and Latinate ratio.
The pre-registration, deviation log, analyser, ablation and G2P-fallback scripts, per-corpus feature tables, and figure-generation code are all openly available in the repository (MIT / CC-BY-4.0).
Comments and criticisms welcome.
Thanks, Kyle Townsend Independent ktownsend@spfk12.org
Corpora mailing list -- corpora@list.elra.info https://list.elra.info/mailman3/postorius/lists/corpora.list.elra.info/ To unsubscribe send an email to corpora-leave@list.elra.info