[Corpora-List]Re: Complex Word Identification in French

21 Jun 2022


      Hi Ada,
In the English language, "words" are a thing. Children are taught to place
spaces between "words". You're not going to undo a millennium-worth of
English writing by discouraging the use of words.
Much of Latin was written without blank spaces to denote word boundaries.
In Chinese writing, there are no blank spaces to denote word-boundaries.
There's assorted NLP software that attempts to guess where those blanks may
be, so that Chinese could be segmented and passed into other NLP pipeline
stages.
When we speak, verbally, we don't put in "blanks" between words, although
there are sometimes pauses. Realistic text-to-speech software NEVER
vocalizes words individually, and instead ALWAYS vocalizes the transition
between words, and places the break within a single phoneme (I hope it's
clear what I am saying here). Thus, from the point of text-to-speech
software, words don't exist, because that is a fundamental requirement for
normal-sounding speech. (For English.)
Now that we live in the world of statistics and deep learning and whatnot,
it's become clear that an audio stream of human speech has some parts that
are "highly conserved" (require certain sounds to follow) and other regions
which are flexible (just about any other sound can follow). And plenty of
stuff in the middle between these two extremes.   Surprisingly (or not
surprisingly, depending on who you are) the highly variable regions are not
word boundaries. Except when ... there are ... well, exceptions.
However, right now, I am not communicating verbally, and so I am faced with
the task of converting thoughts into sequences of (discrete) symbols. As I
learned in first grade, I do this by placing typed spaces between words.
Sure, the concept of "word" may be quite inappropriate for some obscure
languages.  This is entirely plausible, as any "synthetic" language defies
the concept of "word" (Finnish, Lithuanian consist of "words" many of which
are like "antidisestablishmentarianism" and its a children's playground
game of creating the longest such possible expression. Creating new words
in these languages is like creating new sentences in English. It's just
something you do, and there are no "word boundaries" involved.)
Great. So now what?  I assume everything I wrote is 100% mainstream, known
to any and every linguist, half of whom could amplify and correct all the
mistakes I've made in the above.  Sure, but so what? You can't get rid of
the concept of "word". It's a thing.  What, exactly, is being proposed here?
-- linas
On Mon, Jun 20, 2022 at 10:33 AM Ada Wan adawan919@gmail.com wrote:
...
Hi Christopher,
It is of the best interest of the community to discontinue the usage of
"word". The term is not only very shaky in its foundation (if any), but it
can also effect disparity in performance in computational processing and
robustness when human evaluation is involved.
Despite the term has been casually adopted by many in the past, like many
un-PC terms that may have an inappropriate undertone, it needs to be
discouraged and abandoned.
Last but not least, I noticed that you are located in Canada, in the event
that you were to work with any indigenous communities, one MUST be advised
to be careful with the usage of such term --- you could be imposing your
own (EN- / FR- / dominant language-centric) view onto another
individual/community. There is an element of cultural and
linguistic hegemony with the usage of such term (including and not limited
to making applications with it).
Please also consult recent work in this area:
https://openreview.net/forum?id=-llS6TiOew.
Feel free to get in touch if you should have any questions.
Best,
Ada
On Mon, Jun 20, 2022 at 4:53 PM Christopher Collins <
Christopher.Collins@ontariotechu.ca> wrote:
...
Hello,
I’m looking for any open source or cloud-hosted solution for complex word
identification or word difficulty rating in French for a reading
application.
As a backup plan we can use measures like corpus frequency, length,
number of senses, but we’re hoping someone has already made a tool
available.
We found this but that’s it: https://github.com/sheffieldnlp/cwi
Would appreciate any tips!
Thanks,
Chris
*Christopher Collins *[he/him
https://medium.com/gender-inclusivit/why-i-put-pronouns-on-my-email-signature-and-linkedin-profile-and-you-should-too-d3dc942c8743
]
Associate Professor - Faculty of Science
Canada Research Chair in Linguistic Information Visualization
Ontario Tech University
vialab.ca
_______________________________________________
UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
Corpora mailing list -- corpora@list.elra.info
To unsubscribe send an email to corpora-leave@list.elra.info

UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
Corpora mailing list -- corpora@list.elra.info
To unsubscribe send an email to corpora-leave@list.elra.info
-- 
Patrick: Are they laughing at us?
Sponge Bob: No, Patrick, they are laughing next to us.

2026

2025

2024

2023

2022

[Corpora-List]Re: Complex Word Identification in French