Re: [EXTERNAL] Re: New paper: Schwa density as a phonological register classifier (pre-registered, four-corpus replication) - Corpora

16 Apr 2026

      Dear Professor Wan,
Thank you for the pointer to your work — I'll read through the site and
your posts. I appreciate the quickness of your response and look forward to
reading more of it.
I think our concerns may be partly orthogonal. The paper's headline
measure is phone-level (schwa proportion in CMUdict transcriptions), not
word-level; "word" enters only as an operational unit for the
  Flesch-Kincaid baseline we compare against, and the scope is bounded to
English prose register classification on four named corpora. We don't claim
cross-linguistic generality or make prescriptive claims
  about "language."
That said, your point about the undefined status of "word" across writing
systems is well-taken, and I'll add a scope/limitations note making the
English-and-CMUdict dependency explicit rather than implicit. It does seem
important to identify and accommodate implicit anglophone bias in
scientific contexts.
Best,
  Kyle
On Thu, Apr 16, 2026 at 3:22 PM Ada Wan adawan919@gmail.com wrote:
...
Dear Kyle
Please be notified of my findings from 2019 on (see
sites.google.com/view/adawan) as well as all my posts and
replies/comments on X.com since 2021 (@adawan919) and on LinkedIn.com.
Please note that working on/with "word(s)" can be considered a violation
of research integrity and/or of the law.
Feedback:
While I understand that most of my findings might seem distant to those
from "linguistics proper", the most important takeaway is the same: "word"
is not a reliable unit for scientific work. And working on "grammar" and
"language" is / can be unethical. If you can transition to research without
working on or leveraging "w/s/ls/g/l", that'd be optimal. Otherwise, an
immediate feedback to your work would be just do it without "word(s)".
At the first sight, evaluating schwa density of a particular dataset is
not wrong in itself --- in fact, from the perspective of linguistics or
"'language' science" (except my findings show that there cannot be any more
science with "language"), it could even be avant-garde work to estimate,
without "words", schwa density of a given document and/or compare it with
another document. But when one considers how in the context of "language"
(or "w/s/ls/g/l") being un- and under-defined, and often defaulting
subliminally to a prescriptive grammarian perspective, it'd be better and
safer to simply refrain from publishing this kind of philological
contributions (and yes, I understand that within linguistics, this work
would/could be considered scientific/rigorous already, but that is not
enough).
Feel free to let me know if you should have any questions.
Best regards
Ada Wan
https://sites.google.com/view/adawan
On Thu, Apr 16, 2026 at 8:17 PM Kyle Townsend via Corpora <
corpora@list.elra.info> wrote:
...
Dear colleagues,
I'd like to share a new preprint on single-feature register
  classification in English text:
"Schwa Density as a Phonological Stylistic Classifier: Primary
  Stylistic, Secondary Modality -- A Four-Corpus Pre-Registered
  Replication"
Preprint:

https://ling.auf.net/lingbuzz/009926/current.pdf?_s=WPGovroKhmABLC0P
    Materials/code:
https://github.com/kylegtownsend-collab/schwa-density-spgc
    Paper site:     https://papers.letsharkness.com/schwa-density/
The paper tests whether schwa density -- the proportion of vowel
  phones in a text that are unstressed schwa (CMUdict AH0) -- can
  serve as a phonologically motivated single-feature register
  classifier. A pre-registered confirmatory plan was applied to NLTK
  multi-source (N=164) and the Standardized Project Gutenberg Corpus
  (N=2,767), with sensitivity analyses on Brown (N=313) and OANC
  (N=4,375).
Headline findings:

Schwa density matches or exceeds Flesch-Kincaid on all
pre-registered corpora.

A function-word ablation (masking the 198 NLTK English stopwords
before computing schwa density) preserves or amplifies register
discrimination on all four corpora (eta^2 retention 0.93-1.27),
ruling out stopword frequency as a confound.

The ablation operationalises a two-regime finding: schwa density
functions as a Primary Stylistic Feature on within-prose
variation (NLTK, SPGC, Brown) and a Secondary Modality Feature on
speech-versus-writing variation (OANC).

Joint partial-eta^2 retains 46-53% of the register signal on the
pre-registered corpora after controlling jointly for syllables
per word, mean word length, and Latinate ratio.

The pre-registration, deviation log, analyser, ablation and
  G2P-fallback scripts, per-corpus feature tables, and
  figure-generation code are all openly available in the repository
  (MIT / CC-BY-4.0).
Comments and criticisms welcome.
Thanks,
Kyle Townsend
Independent
ktownsend@spfk12.org

Corpora mailing list -- corpora@list.elra.info
https://list.elra.info/mailman3/postorius/lists/corpora.list.elra.info/
To unsubscribe send an email to corpora-leave@list.elra.info
-- 
Thanks,
Kyle Townsend
Instructor, English IIA, Humanities, Yearbook I/II
Scotch Plains-Fanwood High School
Pronouns: he/him/his (What's This? https://www.mypronouns.org/)