[Corpora-List]Re: Complex Word Identification in French

23 Jun 2022

      Hi Milos,
Pardon my late reply. I actually took the time and found some joy in
reading Vít's dissertation (thanks for pointing me to it) --- a "kindred
spirit" to my paper in various versions (FaiR short
https://openreview.net/forum?id=-llS6TiOew, R&B
https://openreview.net/forum?id=dKwmCtp6YI, FaiR long
https://drive.google.com/file/d/1eKbhdZkPJ0HgU1RsGXGFBPGameWIVdt9/view).
I wish I had read his work earlier and I could have sited his work!
I found it interesting, in an insightful way, in his relating structure to
segmentation (p.7). But overall, just the willingness to move on to the
byte space for processing shows linguistic maturity, as opposed to those
who insist to evaluate "language" based on "w*rds" and "sentences". (I wish
people weren't so "hooked" on "w*rds"... it is just a different way of
looking at things, no? When I first submitted my work to confs, I thought
ppl would just get it as "so 'w*rds' are not so good, so let's switch", but
instead, I got totally beaten up (metaphorically) for it...) I think we
could process in bytes, and "back-interpret" to chars.
I think I am on the same page with you on many things. E.g. re "w*rd": I
absolutely agree that it is not the string label but the concept itself (de
dicto vs de re --- fixing the former does not entail fixing the latter,
same with prejudices that are in the humans with all the "PC-terms" etc.
--- at the end of the day, it is our attitude towards everything that
counts, not just what we say). As nomenclature for processing, however, I
think we could use "token", where token can be of any granularity, e.g.
"char token" as a char unigram.... I didn't mean to "police" ppl's
colloquial use of "w*rd" --- I used this form to help raise awareness also.
There is some implicit linguistic hegemony/colonial past embodied therein
in the history of our field.
Re data: I wish we could have a data-centric statistical science for
language! It would help update linguistics beyond the canonical structural
linguistics framework based on grammar, wellformedness etc.. As we are all
data specialists in a way, and we are interested in data --- why not
understand its statistical properties? Corpus linguistics has mostly been
w*rd-based. Most practitioners at ML confs hold data constant to test
algos, why can there be an event where we hold algos constant and test
data?
Thanks and best
Ada
p.s. to the community:
sorry that my initial comment to this thread turned the thread into a
discussion for my paper --- so it seems. But I continue to welcome feedback
on my work as I am still working on an extended version of it (and please
lmk if I should start another thread instead (the thread originator had
been removed from this thread as per his request)!) I think, in general,
the ELRA/corpora-list folks can be more experienced and sophisticated than
most on the "MLNLP" folks, e.g. on twittersphere. I would appreciate your
input.
On Wed, Jun 22, 2022 at 1:42 PM Miloš Jakubíček <
milos.jakubicek@sketchengine.eu> wrote:
...
Hi Ada,
a very good paper (and lot of work done - congrats!) and a very
interesting thread.
Clearly linguistics as a field is terribly lacking some unified taxonomy
(compared to biology, chemistry, whatever -- the difference is rather
striking) and yes, this is causing serious trouble in NLP and general (by
linguists spending time to promote "their" definition instead of promoting
harmonization regardless of their own preferences and traditions). But
"word" here is not a special case -- it's just one of the many linguistic
terms that are ill-defined.
When you say:
*It is of the best interest of the community to discontinue the usage of
"word".*
you must realize this is never going to happen (and just results in
getting an Orwellian hail, quite understandably, you'll not find a lot of
support for prescriptive views on language on this list :).
This is for both practical reasons ("word" is a word everyone is used to
use, it is normal, and being normal is the strongest card you can play in
language) as well as theoretical ones (the problem is not the string-label,
but the concept itself, so changing the label is of no help here)
So, taking this practically:

you may try to convince linguists to make some harmonized taxonomy (and

I wished I knew how to be of any help here, but I don't think that majority
is seeing it as a problem at all - they see it as "complexity/property of
language" - so I don't believe this is going to happen anytime soon)

you may promote the awareness of how word-processing impacts performance

and evaluation of NLP tools
The latter I think is much more useful and more likely to at least
partially succeed -- and I was really happy to see that your paper
quantifies that in state-of-the-art techniques (you may have a look at a
2016 PhD thesis of one of my ex-colleagues:
https://is.muni.cz/th/en6ay/thesis.pdf who investigated character based
language modelling).
It is an omnipresent problem that actually starts with English and simple
tasks like PoS tagging in a slightly different way: the impact of not
following exactly the same tokenization (esp. for high frequency items like
"don't") the tagger expects/was trained on, is huge.
In Sketch Engine where we use tons of word-processing tools (mostly
segmenters/stemmer/PoS tagger/lemmatizers), getting the input tokenization
right is often the most difficult job; and of course, more so for languages
where word is more of an artificial and arbitrary concept.
From a non-academic perspective, the issue is that users expected
something familiar like words, and many are aware of the level of
arbitrariness of "words" in their focus language.
So, reading:
*Well, talk to the NLP crowd or the ones who expect LM/MT results from
different languages should have different performances, even if/when all
else were equal. (I remember how hard and how many rounds I had to work for
my rebuttals....)*
Indeed, this is very much what everyone is used to :-/
From a purely technical perspective, switching to characters (or bytes --
which however are not a good level of abstraction in terms of
interpretation, especially with variable-length encodings like UTF-8)
sometimes is the right thing to do (though the figures get easily skewed --
esp. in a web corpus with plenty of long URLs)
And sometimes the desired level of abstraction is the other way around,
arriving at MWE's being even more of a nightmare than poor "words",
whatever they represent.
And sometimes, the best way is to keep using "words" with lots of policing
around what they are, which btw might very well be Christopher's case with
French.
Which way to go depends on particular use cases, so: is "word" a well
defined unit? Certainly not. Is it a useful one? Sometimes yes.
Cheers and "*I wished the world would give more worth to data*"-too
Milos
Milos Jakubicek
CEO, Lexical Computing
Brno, CZ | Brighton, UK
http://www.lexicalcomputing.com
http://www.sketchengine.eu
On Mon, 20 Jun 2022 at 17:34, Ada Wan adawan919@gmail.com wrote:
...
Hi Christopher,
It is of the best interest of the community to discontinue the usage of
"word". The term is not only very shaky in its foundation (if any), but it
can also effect disparity in performance in computational processing and
robustness when human evaluation is involved.
Despite the term has been casually adopted by many in the past, like many
un-PC terms that may have an inappropriate undertone, it needs to be
discouraged and abandoned.
Last but not least, I noticed that you are located in Canada, in the
event that you were to work with any indigenous communities, one MUST be
advised to be careful with the usage of such term --- you could be imposing
your own (EN- / FR- / dominant language-centric) view onto another
individual/community. There is an element of cultural and
linguistic hegemony with the usage of such term (including and not limited
to making applications with it).
Please also consult recent work in this area:
https://openreview.net/forum?id=-llS6TiOew.
Feel free to get in touch if you should have any questions.
Best,
Ada
On Mon, Jun 20, 2022 at 4:53 PM Christopher Collins <
Christopher.Collins@ontariotechu.ca> wrote:
...
Hello,
I’m looking for any open source or cloud-hosted solution for complex
word identification or word difficulty rating in French for a reading
application.
As a backup plan we can use measures like corpus frequency, length,
number of senses, but we’re hoping someone has already made a tool
available.
We found this but that’s it: https://github.com/sheffieldnlp/cwi
Would appreciate any tips!
Thanks,
Chris
*Christopher Collins *[he/him
https://medium.com/gender-inclusivit/why-i-put-pronouns-on-my-email-signature-and-linkedin-profile-and-you-should-too-d3dc942c8743
]
Associate Professor - Faculty of Science
Canada Research Chair in Linguistic Information Visualization
Ontario Tech University
vialab.ca
_______________________________________________
UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
Corpora mailing list -- corpora@list.elra.info
To unsubscribe send an email to corpora-leave@list.elra.info

UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
Corpora mailing list -- corpora@list.elra.info
To unsubscribe send an email to corpora-leave@list.elra.info

2025

2024

2023

2022

[Corpora-List]Re: Complex Word Identification in French