Hi Ada,
a very good paper (and lot of work done - congrats!) and a very interesting thread.
Clearly linguistics as a field is terribly lacking some unified taxonomy (compared to biology, chemistry, whatever -- the difference is rather striking) and yes, this is causing serious trouble in NLP and general (by linguists spending time to promote "their" definition instead of promoting harmonization regardless of their own preferences and traditions). But "word" here is not a special case -- it's just one of the many linguistic terms that are ill-defined.
When you say:
*It is of the best interest of the community to discontinue the usage of "word".*
you must realize this is never going to happen (and just results in getting an Orwellian hail, quite understandably, you'll not find a lot of support for prescriptive views on language on this list :). This is for both practical reasons ("word" is a word everyone is used to use, it is normal, and being normal is the strongest card you can play in language) as well as theoretical ones (the problem is not the string-label, but the concept itself, so changing the label is of no help here)
So, taking this practically: - you may try to convince linguists to make some harmonized taxonomy (and I wished I knew how to be of any help here, but I don't think that majority is seeing it as a problem at all - they see it as "complexity/property of language" - so I don't believe this is going to happen anytime soon) - you may promote the awareness of how word-processing impacts performance and evaluation of NLP tools
The latter I think is much more useful and more likely to at least partially succeed -- and I was really happy to see that your paper quantifies that in state-of-the-art techniques (you may have a look at a 2016 PhD thesis of one of my ex-colleagues: https://is.muni.cz/th/en6ay/thesis.pdf who investigated character based language modelling). It is an omnipresent problem that actually starts with English and simple tasks like PoS tagging in a slightly different way: the impact of not following exactly the same tokenization (esp. for high frequency items like "don't") the tagger expects/was trained on, is huge. In Sketch Engine where we use tons of word-processing tools (mostly segmenters/stemmer/PoS tagger/lemmatizers), getting the input tokenization right is often the most difficult job; and of course, more so for languages where word is more of an artificial and arbitrary concept. From a non-academic perspective, the issue is that users expected something familiar like words, and many are aware of the level of arbitrariness of "words" in their focus language.
So, reading:
*Well, talk to the NLP crowd or the ones who expect LM/MT results from different languages should have different performances, even if/when all else were equal. (I remember how hard and how many rounds I had to work for my rebuttals....)*
Indeed, this is very much what everyone is used to :-/ From a purely technical perspective, switching to characters (or bytes -- which however are not a good level of abstraction in terms of interpretation, especially with variable-length encodings like UTF-8) sometimes is the right thing to do (though the figures get easily skewed -- esp. in a web corpus with plenty of long URLs) And sometimes the desired level of abstraction is the other way around, arriving at MWE's being even more of a nightmare than poor "words", whatever they represent. And sometimes, the best way is to keep using "words" with lots of policing around what they are, which btw might very well be Christopher's case with French. Which way to go depends on particular use cases, so: is "word" a well defined unit? Certainly not. Is it a useful one? Sometimes yes.
Cheers and "*I wished the world would give more worth to data*"-too Milos
Milos Jakubicek
CEO, Lexical Computing Brno, CZ | Brighton, UK http://www.lexicalcomputing.com http://www.sketchengine.eu
On Mon, 20 Jun 2022 at 17:34, Ada Wan adawan919@gmail.com wrote:
Hi Christopher,
It is of the best interest of the community to discontinue the usage of "word". The term is not only very shaky in its foundation (if any), but it can also effect disparity in performance in computational processing and robustness when human evaluation is involved. Despite the term has been casually adopted by many in the past, like many un-PC terms that may have an inappropriate undertone, it needs to be discouraged and abandoned. Last but not least, I noticed that you are located in Canada, in the event that you were to work with any indigenous communities, one MUST be advised to be careful with the usage of such term --- you could be imposing your own (EN- / FR- / dominant language-centric) view onto another individual/community. There is an element of cultural and linguistic hegemony with the usage of such term (including and not limited to making applications with it). Please also consult recent work in this area: https://openreview.net/forum?id=-llS6TiOew.
Feel free to get in touch if you should have any questions.
Best, Ada
On Mon, Jun 20, 2022 at 4:53 PM Christopher Collins < Christopher.Collins@ontariotechu.ca> wrote:
Hello,
I’m looking for any open source or cloud-hosted solution for complex word identification or word difficulty rating in French for a reading application.
As a backup plan we can use measures like corpus frequency, length, number of senses, but we’re hoping someone has already made a tool available.
We found this but that’s it: https://github.com/sheffieldnlp/cwi
Would appreciate any tips!
Thanks,
Chris
*Christopher Collins *[he/him https://medium.com/gender-inclusivit/why-i-put-pronouns-on-my-email-signature-and-linkedin-profile-and-you-should-too-d3dc942c8743 ] Associate Professor - Faculty of Science Canada Research Chair in Linguistic Information Visualization Ontario Tech University vialab.ca _______________________________________________ UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora Corpora mailing list -- corpora@list.elra.info To unsubscribe send an email to corpora-leave@list.elra.info
UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora Corpora mailing list -- corpora@list.elra.info To unsubscribe send an email to corpora-leave@list.elra.info