[Corpora-List]Re: Complex Word Identification in French

22 Jun 2022


      Hi Ada,
a very good paper (and lot of work done - congrats!) and a very interesting
thread.
Clearly linguistics as a field is terribly lacking some unified taxonomy
(compared to biology, chemistry, whatever -- the difference is rather
striking) and yes, this is causing serious trouble in NLP and general (by
linguists spending time to promote "their" definition instead of promoting
harmonization regardless of their own preferences and traditions). But
"word" here is not a special case -- it's just one of the many linguistic
terms that are ill-defined.
When you say:
*It is of the best interest of the community to discontinue the usage of
"word".*
you must realize this is never going to happen (and just results in getting
an Orwellian hail, quite understandably, you'll not find a lot of support
for prescriptive views on language on this list :).
This is for both practical reasons ("word" is a word everyone is used to
use, it is normal, and being normal is the strongest card you can play in
language) as well as theoretical ones (the problem is not the string-label,
but the concept itself, so changing the label is of no help here)
So, taking this practically:
- you may try to convince linguists to make some harmonized taxonomy (and I
wished I knew how to be of any help here, but I don't think that majority
is seeing it as a problem at all - they see it as "complexity/property of
language" - so I don't believe this is going to happen anytime soon)
- you may promote the awareness of how word-processing impacts performance
and evaluation of NLP tools
The latter I think is much more useful and more likely to at least
partially succeed -- and I was really happy to see that your paper
quantifies that in state-of-the-art techniques (you may have a look at a
2016 PhD thesis of one of my ex-colleagues:
https://is.muni.cz/th/en6ay/thesis.pdf who investigated character based
language modelling).
It is an omnipresent problem that actually starts with English and simple
tasks like PoS tagging in a slightly different way: the impact of not
following exactly the same tokenization (esp. for high frequency items like
"don't") the tagger expects/was trained on, is huge.
In Sketch Engine where we use tons of word-processing tools (mostly
segmenters/stemmer/PoS tagger/lemmatizers), getting the input tokenization
right is often the most difficult job; and of course, more so for languages
where word is more of an artificial and arbitrary concept.
From a non-academic perspective, the issue is that users expected something
familiar like words, and many are aware of the level of arbitrariness of
"words" in their focus language.
So, reading:
*Well, talk to the NLP crowd or the ones who expect LM/MT results from
different languages should have different performances, even if/when all
else were equal. (I remember how hard and how many rounds I had to work for
my rebuttals....)*
Indeed, this is very much what everyone is used to :-/
From a purely technical perspective, switching to characters (or bytes --
which however are not a good level of abstraction in terms of
interpretation, especially with variable-length encodings like UTF-8)
sometimes is the right thing to do (though the figures get easily skewed --
esp. in a web corpus with plenty of long URLs)
And sometimes the desired level of abstraction is the other way around,
arriving at MWE's being even more of a nightmare than poor "words",
whatever they represent.
And sometimes, the best way is to keep using "words" with lots of policing
around what they are, which btw might very well be Christopher's case with
French.
Which way to go depends on particular use cases, so: is "word" a well
defined unit? Certainly not. Is it a useful one? Sometimes yes.
Cheers and "*I wished the world would give more worth to data*"-too
Milos
Milos Jakubicek
CEO, Lexical Computing
Brno, CZ | Brighton, UK
http://www.lexicalcomputing.com
http://www.sketchengine.eu
On Mon, 20 Jun 2022 at 17:34, Ada Wan adawan919@gmail.com wrote:
...
Hi Christopher,
It is of the best interest of the community to discontinue the usage of
"word". The term is not only very shaky in its foundation (if any), but it
can also effect disparity in performance in computational processing and
robustness when human evaluation is involved.
Despite the term has been casually adopted by many in the past, like many
un-PC terms that may have an inappropriate undertone, it needs to be
discouraged and abandoned.
Last but not least, I noticed that you are located in Canada, in the event
that you were to work with any indigenous communities, one MUST be advised
to be careful with the usage of such term --- you could be imposing your
own (EN- / FR- / dominant language-centric) view onto another
individual/community. There is an element of cultural and
linguistic hegemony with the usage of such term (including and not limited
to making applications with it).
Please also consult recent work in this area:
https://openreview.net/forum?id=-llS6TiOew.
Feel free to get in touch if you should have any questions.
Best,
Ada
On Mon, Jun 20, 2022 at 4:53 PM Christopher Collins <
Christopher.Collins@ontariotechu.ca> wrote:
...
Hello,
I’m looking for any open source or cloud-hosted solution for complex word
identification or word difficulty rating in French for a reading
application.
As a backup plan we can use measures like corpus frequency, length,
number of senses, but we’re hoping someone has already made a tool
available.
We found this but that’s it: https://github.com/sheffieldnlp/cwi
Would appreciate any tips!
Thanks,
Chris
*Christopher Collins *[he/him
https://medium.com/gender-inclusivit/why-i-put-pronouns-on-my-email-signature-and-linkedin-profile-and-you-should-too-d3dc942c8743
]
Associate Professor - Faculty of Science
Canada Research Chair in Linguistic Information Visualization
Ontario Tech University
vialab.ca
_______________________________________________
UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
Corpora mailing list -- corpora@list.elra.info
To unsubscribe send an email to corpora-leave@list.elra.info

UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
Corpora mailing list -- corpora@list.elra.info
To unsubscribe send an email to corpora-leave@list.elra.info

2025

2024

2023

2022

[Corpora-List]Re: Complex Word Identification in French