[Corpora-List]Re: Complex Word Identification in French

20 Jun 2022


      Yeah... I really don't know what to do with "the resistance", "the
ignorance" (as in, both the practice of intentionally ignoring my results,
and otherwise)... etc.. Many of us are so used to both naming and
processing at such granularity... it'd take the whole world for change to
happen and I hope we could work together instead of against each other.
If we sustain the w hackery, we are hurting ourselves / our students in the
end. From the tech point of view, things are getting ever finer and
our hardware can take on more, I don't know what ppl have in mind if/when
they want our next gen to be quibbling about things like "is ... a w*rd?"
or whether the w boundary should be placed exactly here or there....
I suggested in "Fairness in Representation
https://openreview.net/pdf?id=-llS6TiOew" that we should use clearer
nomenclature:
"one can simply describe languages and their statistical profiles with
respect to their representational granularity in characters or bytes (which
are and/or can be exhaustively standardized in computing), or refer to
sequences as longer/shorter or having a higher/lower vocabulary size when
comparing them with each other, rather than “richer”/“poorer” based on
concepts (e.g. “words”, “sentences”) that can be ambiguous, contested, and
inaccessible to many."
And in the event that our downstream task results are so good already,
there will always be room for content creation. That's also what I tried to
advocate in the paper: quality multiway datasets for science, eval, and
documentation --- I wished the world would give more worth to data. There
is a lot of room to translate our engineering back to science, to
re-educate students in the lang sci/tech to become more stats savvy, and to
re-educate the world about fairness in multilinguality...
On Mon, Jun 20, 2022 at 10:13 PM Daniel HENKEL daniel.henkel@univ-paris8.fr
wrote:
...
Not to mention all these shamefully unscientific posts on Corporalist:
*12th International Global W*rdnet Conference Donostia / San Sebastian,
Basque Country 23-27, 2023 Global W*rdnet Association:
www.globalw*rdnet.org http://rdnet.org*
*Conference website: https://hitz.eus/gwc2023 https://hitz.eus/gwc2023*
*18th Workshop on Multiw*rd Expressions (MWE 2022) Organized and sponsored
by SIGLEX, the Special Interest Group on the Lexicon of the ACL*
*The 5th Workshop on Multi-w*rd Units in Machine Translation and
Translation Technology (MUMTTT 2022) Malaga, 30th September 2022*
...
Definitely time for some lexical/terminological restrictions/updates, for
the sake of goodthink/processing, and science!
(actually "science" is heretical/redundant, "goodthink/processing" will do
the job:
*"As we have already seen in the case of the word FREE, w*rds which had
once borne a heretical meaning were sometimes retained for the sake of
convenience, but only with the undesirable meanings purged out of them.
Countless other w*rds such as HONOUR, JUSTICE, MORALITY, INTERNATIONALISM,
DEMOCRACY, SCIENCE, and RELIGION had simply ceased to exist."*)
DH
On 20/06/2022 21:47, Daniel HENKEL wrote:
Looks as if Linguistlist is in need of some scientific enlightenment as
well :
http://linguistlist.org/issues/33/33-2063.html
*In the new, thoroughly revised second edition of W*rds of Wonder:
Endangered Languages and What They Tell Us, Second Edition (formerly called
Dying W*rds: Endangered Languages and What They Have to Tell Us), renowned
scholar Nicholas Evans delivers an accessible and incisive text covering
the impact of mass language endangerment. The distinguished author explores
issues surrounding the preservation of indigenous languages, ...*
(ungood w*rds unw*rded to protect the faint of mind against ungood
thinking/processing).
Best,
DH
On 20/06/2022 20:27, Ada Wan wrote:
(I just expounded on a point as a twitter reply today re the granularity
of one's thinking/processing. Pls feel free to read that also.)
One can think of it in a less binary manner --- not "good" vs "bad", not
"words" then "sentences", but to think of an utterance/sequence with all
the finer connections in between... That is the beauty of language --- from
a "philological" point of view.
I am not sure, though, if you were speaking from a scientific perspective,
because I have a paper to back my argument in that regard.
On Mon, Jun 20, 2022 at 6:06 PM Sylvain Kahane sylvain@kahane.fr wrote:
...
“We’re destroying words–scores of them, hundreds of them, every day.
We’re cutting the language down to the bone.” […]
“It’s a beautiful thing, the destruction of words. Of course the great
advantage is in the verbs and adjectives, but there are hundreds of nouns
that can be got rid of as well. It isn’t only the synonyms; there are also
the antonyms. After all, what justification is there for a word which is
simply the opposite of some other words? A word contains its opposite in
itself. Take ‘good,’ for instance. If you have a word like ‘good,’ what
need is there for a word like ‘bad’? ‘Ungood’ will do just as well–better,
because it’s an exact opposite, which the other is not. Or again, if you
want a stronger version of ‘good,’ what sense is there in having a whole
string of vague useless words like ‘excellent’ and ‘splendid’ and all the
rest of them? ‘Plusgood’ covers the meaning, or ‘doubleplusgood’ if you
want something stronger still. Of course we use those forms already, but in
the final version of Newspeak there’ll be nothing else. In the end the
whole notion of goodness and badness will be covered by only six words–in
reality, only one word. Don’t you see the beauty of that, Ada?…”
George Orwell, 1984
...
Le 20 juin 2022 à 17:33, Ada Wan adawan919@gmail.com a écrit :
Hi Christopher,
It is of the best interest of the community to discontinue the usage of
"word". The term is not only very shaky in its foundation (if any), but it
can also effect disparity in performance in computational processing and
robustness when human evaluation is involved.
...
Despite the term has been casually adopted by many in the past, like
many un-PC terms that may have an inappropriate undertone, it needs to be
discouraged and abandoned.
...
Last but not least, I noticed that you are located in Canada, in the
event that you were to work with any indigenous communities, one MUST be
advised to be careful with the usage of such term --- you could be imposing
your own (EN- / FR- / dominant language-centric) view onto another
individual/community. There is an element of cultural and linguistic
hegemony with the usage of such term (including and not limited to making
applications with it).
...
Please also consult recent work in this area:
https://openreview.net/forum?id=-llS6TiOew.
...
Feel free to get in touch if you should have any questions.
Best,
Ada
On Mon, Jun 20, 2022 at 4:53 PM Christopher Collins <
Christopher.Collins@ontariotechu.ca> wrote:
...
Hello,
I’m looking for any open source or cloud-hosted solution for complex
word identification or word difficulty rating in French for a reading
application.
...
As a backup plan we can use measures like corpus frequency, length,
number of senses, but we’re hoping someone has already made a tool
available.
...
We found this but that’s it: https://github.com/sheffieldnlp/cwi
Would appreciate any tips!
Thanks,
Chris
Christopher Collins [he/him]
Associate Professor - Faculty of Science
Canada Research Chair in Linguistic Information Visualization
Ontario Tech University
vialab.ca

UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
Corpora mailing list -- corpora@list.elra.info
To unsubscribe send an email to corpora-leave@list.elra.info
_______________________________________________
UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
Corpora mailing list -- corpora@list.elra.info
To unsubscribe send an email to corpora-leave@list.elra.info

UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
Corpora mailing list -- corpora@list.elra.info
To unsubscribe send an email to corpora-leave@list.elra.info
--
Daniel HENKEL https://univ-paris8.academia.edu/DanielHENKEL
*Maître de Conférences (Linguistique et Traduction) UFR5 LLCE-LEA • EA1569
TransCrit*
Université Paris 8 Vincennes-St-Denis
*“non si può stendere una tipologia delle traduzioni, ma al massimo una
tipologia di diversi modi di tradurre, volta per volta negoziando il fine
che ci si propone – e volta per volta scoprendo che i modi di tradurre sono
più di quelli che sospettiamo.”* U. Eco

UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
Corpora mailing list -- corpora@list.elra.info
To unsubscribe send an email to corpora-leave@list.elra.info
--
Daniel HENKEL https://univ-paris8.academia.edu/DanielHENKEL
*Maître de Conférences (Linguistique et Traduction) UFR5 LLCE-LEA • EA1569
TransCrit*
Université Paris 8 Vincennes-St-Denis
*“non si può stendere una tipologia delle traduzioni, ma al massimo una
tipologia di diversi modi di tradurre, volta per volta negoziando il fine
che ci si propone – e volta per volta scoprendo che i modi di tradurre sono
più di quelli che sospettiamo.”* U. Eco

2026

2025

2024

2023

2022

[Corpora-List]Re: Complex Word Identification in French