Dear colleagues,
I'd like to share a new preprint on single-feature register classification in English text:
"Schwa Density as a Phonological Stylistic Classifier: Primary Stylistic, Secondary Modality -- A Four-Corpus Pre-Registered Replication"
Preprint: https://ling.auf.net/lingbuzz/009926/current.pdf?_s=WPGovroKhmABLC0P Materials/code: https://github.com/kylegtownsend-collab/schwa-density-spgc Paper site: https://papers.letsharkness.com/schwa-density/
The paper tests whether schwa density -- the proportion of vowel phones in a text that are unstressed schwa (CMUdict AH0) -- can serve as a phonologically motivated single-feature register classifier. A pre-registered confirmatory plan was applied to NLTK multi-source (N=164) and the Standardized Project Gutenberg Corpus (N=2,767), with sensitivity analyses on Brown (N=313) and OANC (N=4,375).
Headline findings:
- Schwa density matches or exceeds Flesch-Kincaid on all pre-registered corpora.
- A function-word ablation (masking the 198 NLTK English stopwords before computing schwa density) preserves or amplifies register discrimination on all four corpora (eta^2 retention 0.93-1.27), ruling out stopword frequency as a confound.
- The ablation operationalises a two-regime finding: schwa density functions as a Primary Stylistic Feature on within-prose variation (NLTK, SPGC, Brown) and a Secondary Modality Feature on speech-versus-writing variation (OANC).
- Joint partial-eta^2 retains 46-53% of the register signal on the pre-registered corpora after controlling jointly for syllables per word, mean word length, and Latinate ratio.
The pre-registration, deviation log, analyser, ablation and G2P-fallback scripts, per-corpus feature tables, and figure-generation code are all openly available in the repository (MIT / CC-BY-4.0).
Comments and criticisms welcome.
Thanks, Kyle Townsend Independent ktownsend@spfk12.org