Hi Milos,
Pardon my late reply. I actually took the time and found some joy in reading Vít's dissertation (thanks for pointing me to it) --- a "kindred spirit" to my paper in various versions (FaiR short https://openreview.net/forum?id=-llS6TiOew, R&B https://openreview.net/forum?id=dKwmCtp6YI, FaiR long https://drive.google.com/file/d/1eKbhdZkPJ0HgU1RsGXGFBPGameWIVdt9/view). I wish I had read his work earlier and I could have sited his work! I found it interesting, in an insightful way, in his relating structure to segmentation (p.7). But overall, just the willingness to move on to the byte space for processing shows linguistic maturity, as opposed to those who insist to evaluate "language" based on "w*rds" and "sentences". (I wish people weren't so "hooked" on "w*rds"... it is just a different way of looking at things, no? When I first submitted my work to confs, I thought ppl would just get it as "so 'w*rds' are not so good, so let's switch", but instead, I got totally beaten up (metaphorically) for it...) I think we could process in bytes, and "back-interpret" to chars.
I think I am on the same page with you on many things. E.g. re "w*rd": I absolutely agree that it is not the string label but the concept itself (de dicto vs de re --- fixing the former does not entail fixing the latter, same with prejudices that are in the humans with all the "PC-terms" etc. --- at the end of the day, it is our attitude towards everything that counts, not just what we say). As nomenclature for processing, however, I think we could use "token", where token can be of any granularity, e.g. "char token" as a char unigram.... I didn't mean to "police" ppl's colloquial use of "w*rd" --- I used this form to help raise awareness also. There is some implicit linguistic hegemony/colonial past embodied therein in the history of our field.
Re data: I wish we could have a data-centric statistical science for language! It would help update linguistics beyond the canonical structural linguistics framework based on grammar, wellformedness etc.. As we are all data specialists in a way, and we are interested in data --- why not understand its statistical properties? Corpus linguistics has mostly been w*rd-based. Most practitioners at ML confs hold data constant to test algos, why can there be an event where we hold algos constant and test data?
Thanks and best Ada
p.s. to the community: sorry that my initial comment to this thread turned the thread into a discussion for my paper --- so it seems. But I continue to welcome feedback on my work as I am still working on an extended version of it (and please lmk if I should start another thread instead (the thread originator had been removed from this thread as per his request)!) I think, in general, the ELRA/corpora-list folks can be more experienced and sophisticated than most on the "MLNLP" folks, e.g. on twittersphere. I would appreciate your input.
On Wed, Jun 22, 2022 at 1:42 PM Miloš Jakubíček < milos.jakubicek@sketchengine.eu> wrote:
Hi Ada,
a very good paper (and lot of work done - congrats!) and a very interesting thread.
Clearly linguistics as a field is terribly lacking some unified taxonomy (compared to biology, chemistry, whatever -- the difference is rather striking) and yes, this is causing serious trouble in NLP and general (by linguists spending time to promote "their" definition instead of promoting harmonization regardless of their own preferences and traditions). But "word" here is not a special case -- it's just one of the many linguistic terms that are ill-defined.
When you say:
*It is of the best interest of the community to discontinue the usage of "word".*
you must realize this is never going to happen (and just results in getting an Orwellian hail, quite understandably, you'll not find a lot of support for prescriptive views on language on this list :). This is for both practical reasons ("word" is a word everyone is used to use, it is normal, and being normal is the strongest card you can play in language) as well as theoretical ones (the problem is not the string-label, but the concept itself, so changing the label is of no help here)
So, taking this practically:
- you may try to convince linguists to make some harmonized taxonomy (and
I wished I knew how to be of any help here, but I don't think that majority is seeing it as a problem at all - they see it as "complexity/property of language" - so I don't believe this is going to happen anytime soon)
- you may promote the awareness of how word-processing impacts performance
and evaluation of NLP tools
The latter I think is much more useful and more likely to at least partially succeed -- and I was really happy to see that your paper quantifies that in state-of-the-art techniques (you may have a look at a 2016 PhD thesis of one of my ex-colleagues: https://is.muni.cz/th/en6ay/thesis.pdf who investigated character based language modelling). It is an omnipresent problem that actually starts with English and simple tasks like PoS tagging in a slightly different way: the impact of not following exactly the same tokenization (esp. for high frequency items like "don't") the tagger expects/was trained on, is huge. In Sketch Engine where we use tons of word-processing tools (mostly segmenters/stemmer/PoS tagger/lemmatizers), getting the input tokenization right is often the most difficult job; and of course, more so for languages where word is more of an artificial and arbitrary concept. From a non-academic perspective, the issue is that users expected something familiar like words, and many are aware of the level of arbitrariness of "words" in their focus language.
So, reading:
*Well, talk to the NLP crowd or the ones who expect LM/MT results from different languages should have different performances, even if/when all else were equal. (I remember how hard and how many rounds I had to work for my rebuttals....)*
Indeed, this is very much what everyone is used to :-/ From a purely technical perspective, switching to characters (or bytes -- which however are not a good level of abstraction in terms of interpretation, especially with variable-length encodings like UTF-8) sometimes is the right thing to do (though the figures get easily skewed -- esp. in a web corpus with plenty of long URLs) And sometimes the desired level of abstraction is the other way around, arriving at MWE's being even more of a nightmare than poor "words", whatever they represent. And sometimes, the best way is to keep using "words" with lots of policing around what they are, which btw might very well be Christopher's case with French. Which way to go depends on particular use cases, so: is "word" a well defined unit? Certainly not. Is it a useful one? Sometimes yes.
Cheers and "*I wished the world would give more worth to data*"-too Milos
Milos Jakubicek
CEO, Lexical Computing Brno, CZ | Brighton, UK http://www.lexicalcomputing.com http://www.sketchengine.eu
On Mon, 20 Jun 2022 at 17:34, Ada Wan adawan919@gmail.com wrote:
Hi Christopher,
It is of the best interest of the community to discontinue the usage of "word". The term is not only very shaky in its foundation (if any), but it can also effect disparity in performance in computational processing and robustness when human evaluation is involved. Despite the term has been casually adopted by many in the past, like many un-PC terms that may have an inappropriate undertone, it needs to be discouraged and abandoned. Last but not least, I noticed that you are located in Canada, in the event that you were to work with any indigenous communities, one MUST be advised to be careful with the usage of such term --- you could be imposing your own (EN- / FR- / dominant language-centric) view onto another individual/community. There is an element of cultural and linguistic hegemony with the usage of such term (including and not limited to making applications with it). Please also consult recent work in this area: https://openreview.net/forum?id=-llS6TiOew.
Feel free to get in touch if you should have any questions.
Best, Ada
On Mon, Jun 20, 2022 at 4:53 PM Christopher Collins < Christopher.Collins@ontariotechu.ca> wrote:
Hello,
I’m looking for any open source or cloud-hosted solution for complex word identification or word difficulty rating in French for a reading application.
As a backup plan we can use measures like corpus frequency, length, number of senses, but we’re hoping someone has already made a tool available.
We found this but that’s it: https://github.com/sheffieldnlp/cwi
Would appreciate any tips!
Thanks,
Chris
*Christopher Collins *[he/him https://medium.com/gender-inclusivit/why-i-put-pronouns-on-my-email-signature-and-linkedin-profile-and-you-should-too-d3dc942c8743 ] Associate Professor - Faculty of Science Canada Research Chair in Linguistic Information Visualization Ontario Tech University vialab.ca _______________________________________________ UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora Corpora mailing list -- corpora@list.elra.info To unsubscribe send an email to corpora-leave@list.elra.info
UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora Corpora mailing list -- corpora@list.elra.info To unsubscribe send an email to corpora-leave@list.elra.info