Re: Lemmas and Lemmatization [was Re: NIF: NLP Interchange Format] - Corpora

1 Nov 2023


      Hi Ada,
As these threads consists in a discussion rather than a set of scientific statements (the first one being motivated by responding to a stimuli, while the second consists in defining/motivating a scientific position that is supposed to stand aside of any specific discussion), I forbid you to use any of my writings made on the corpora list in any of your web sites.
Of course, I still authorise corpora list to keep archives (as these are maintained along with the full discussion context).
Regards,
Gilles Sérasset,
...
On 31 Oct 2023, at 19:19, Ada Wan adawan919@gmail.com wrote:
Dear all
I am about to post CorporaList threads which I have responded to on my own website, as it seems some of my replies are not yet showing on the public website (https://list.elra.info/mailman3/hyperkitty/list/corpora@list.elra.info/ https://list.elra.info/mailman3/hyperkitty/list/corpora@list.elra.info/). 
If any of you should have any objections to this (because you don't want your replies to be seen), please let me know asap.
Thanks and best
Ada
On Mon, Oct 30, 2023 at 9:31 PM Ada Wan <adawan919@gmail.com mailto:adawan919@gmail.com> wrote:
[Disregard if not interested]
Dear all
Thanks for your emails. The issue of where the misunderstanding might lie is clearer to me now, esp. given Gilles' example with his niece. 
(@Anil: perhaps you are right in your observations in a possible style change in my correspondences --- I may well have been running out of patience at this point (considering I have been in rebuttal mode since at least 2019 [1]?! So it's a good thing that morphology is coming to an end!). In the beginning, I had expected the professionals whom I expect to be experienced in "language"/data matters (and the subscribers of the CorporaList) to be the first to appreciate my results, but it turned out to be the other way around, it seems. Those who have been exposed to fewer "language tales" [2] can be quicker in getting it. But anyway, please allow me to explain again below.)
Most importantly, in the niece example, there are 2 things that should be discerned from one another: 
i. what the niece uttered [i.e. data/observation (do note also how the data is collected: recorded or transcribed?)], and 
ii. what one's interpretation/analysis of her utterance is [i.e. interpretation/analysis of observation].
In "grammarese" formulation, the case in question is as follows: Gilles' niece conjugated an irregular verb with a regular verb conjugation pattern.[3]
Gilles suspects that (linguistic) morphology exists (and/or is universal?) because the pattern of the niece's utterance resembled one of the patterns (sometimes formulated from "rules" [4]) often studied in literature on morphology.
Re "she clearly showed me that her way of learning languages did not consisted in reading/listening to huge amounts of utterances ...":
even if the niece had only been exposed to 10 utterances, if 8 of which exhibit a certain pattern, and 2 of which are more irregular/outlier-like, chances of her applying habits that are in line with the pattern observed more often in the rarer/unobserved cases can be high --- and would you not agree that's rather reasonable?
There are or may be un-/subconscious *patterns*, sure. But I do not argue against these, for such patterns do not have to be formulated in terms of "stems"/"roots"/"affixes", and more importantly, most of these patterns surface more often in books than in real life anyway. So the fact that one believes that a morphological paradigm is to be formulated in a certain way is pretty much a matter of preference of a (group of) researcher(s).
Re "but she was able to learn some word formation rules from very few examples": 
what she "learned" might just be some patterns --- at least according to your/our analysis here. That is, she might not have yet had much exposure to "rules", but Gilles might have. (Hence his conviction of the reality of morphology may be stronger.)
Re "In my humble opinion, this proves that morphology exists, if not in the LLM matrixes, at least in the human brain": 
I don't disagree with how one's mind can be clouded by archaic ideals or theories. But shouldn't a better theory exist outside of the mind of a person or a group of scientists as well?
If one accounts for text data in its entirety, i.e. without disregarding or adding in whitespaces, evaluate in bigger span (as mentioned in the rebuttal here [5]), the notion of morphology is actually irrelevant to a comprehensive study of (language) data. Wouldn't you agree? 
With your plane and bird analogy: so you could claim that if you do insist on cherry-picking from data, shouldn't your analyses still matter? Well, if they don't generalize well, they may end up mattering to you only.
Re "... (or issued from a colonialist point of view of Aves on the task at hand…) and asking them to renounce this oh so obsolete bad habit": 
I suppose it depends on which side of history one would like to be on too.
I understand that it can be much harder for those who have lived in a country where "language" activities (and/or the concept of "language") have been officially and explicitly supported/promoted. This "privilege" now puts many of us in a rather disadvantageous position in unlearning much.
Re "ML based language models": 
I don't know what you understand of these, but the logic behind such (e.g. a probabilistic processing/interpretation of sequences) is often not far from how "humans" are known to "process language(s)" --- which is why many modeling experiments can bridge "both spheres" (though I believe many experienced in modeling would buy less into this "human 'versus' machine" narrative).
@Gilles: I am also curious what your takeaway is from Quine's "Word and Object" (e.g. at https://mitpress.mit.edu/9780262670012/word-and-object/ https://mitpress.mit.edu/9780262670012/word-and-object/) in relation to our conversation here.
@Anil: the computational phenomenology is already in "Fairness in Representation" (note that the insights were obtained from a collection of many models, i.e. most of them are epi-phenomena). So I think what I have in mind is orthogonal to what you described. Crimes and other misconduct have also been around for millenia, are these things we want to keep?
That having been clarified, do you have other objections to my contributions?
I hope I have addressed your concerns sufficiently. If not, please let me know.
Thanks and best
Ada
[1] The results that ending up getting published in Fairness in Representation https://openreview.net/forum?id=-llS6TiOew (ICLR 2022) had been rejected about 5 times, those in "Statistical (Un-)typology" (even with "greedy" research incentives so to fit in) about another 5 times from May 2019 to April 2022, in addition to other attempts/withdrawals. Then all I have been dealing with is just retaliation. In fact, I just got some stuff stolen and had to get things reported to the police, so please pardon my delay in reply. 
[2] At a point, I thought perhaps it'd be best to have no disciplines. Then I realized not all disciplines are like "language", "linguistics", or "structural linguistics". 
That having been expressed, can having "no disciplines" be still a good thing? Possibly, but another debate, another time, perhaps. 
[3] But let's bear in mind: what one'd consider a "regular verb" (vs "irregular verb") is nothing but some sequence/utterance seen/heard more frequently than others. 
[4] esp. in the history of "transformational grammar" that was popular around the mid 20th century. "Grammar rules" might have been around for longer, but branding things as within the domain of "morphology" as a module of a bigger "structure"/"structural framework" of "linguistic analysis" is a matter that has become more popular only in the past half a century or so due to "transformational grammar" / "structural linguistics". 
But please do note that even in "structural linguistics", many patterns are explained away in terms of (the ranking of) constraints (i.e. no "transformation"). There are no/few reasons to posit the notion of "deep structure(s)", from/through which, in the case of morphological analyses, "stems"/"roots" get to be held often as the bases of inflection. That is, aside from "grammar rules" taught in e.g. schools and those inside of researchers' mind, evidence for the existence of "rules" is actually rather little, if any. [N.B. this can be considered advanced for those who didn't have a theoretical background in Linguistics.]
[5] https://openreview.net/forum?id=-llS6TiOew https://openreview.net/forum?id=-llS6TiOew
On Thu, Oct 26, 2023 at 6:05 PM Anil Singh <anil.phdcl@gmail.com mailto:anil.phdcl@gmail.com> wrote:
I have also been carefully reading the exchanges. Although I was planning not to add to this exchange, at this point I am tempted to reply.
Ada's early emails were adding something to the discussion and debate, but at this point they are simply saying 'I am right, you are wrong', without giving any explanation or evidence.
I was also thinking of the same kind of examples as given by Gilles. Till Ada provides some very good reasoning and evidence, it is hard for me to completely agree with her, although as I said earlier, I do agree with her on many, perhaps most of things.
Ada, I sincerely respect your learning and competence. However, you said earlier you are proposing an alternative computational phenomenology. That would be really interesting. Won't it be better to first propose it and argue in more specific terms and with more convincing arguments and evidence that it is the right one, or at least 'more right' than the existing ones (there are more than one). Given that there is already Information Theory, it has to go beyond byte, which is an accidental unit of computation, and character, which is also not well-defined, sometimes even for one specific writing system. To give one such example, perhaps not the best one, I always thought of Indic script dependent vowel (maatraa) as a character, but I recently found that languages like Java and Python do not treat such written symbols as character, so when I try to get the length of an Indic-script string, the in-built string length functions give only the number of consonant symbols and independent vowels in the string.  We got wrong results using these functions and I only accidentally discovered that this is the case. The reason, of course, is that these functions and programming languages treat such dependent vowels as diacritics, which is also correct in some ways. I did not realize this earlier because in India we often use a Latin script-based notation called WX for Indic scripts in NLP due to the encoding and input method related problems that I referred to in one of my earlier replies. The WX notation, however, does not distinguish between dependent and independent vowels and treats both of them as the same character, which is how most of us, if not all, think of them in India to the best of my knowledge. On the other hand, the consonant symbol modifier 'halant' is not used in WX, but is used in Indic-scripts and its presence might also cause disagreements about what the string length is. In other words, character as a unit does not work in your terms. In fact, who knows how many errors for Indic script text have made their way into computational results due to this simple fact. And perhaps they still do because it took me a long time to realize this, which at first led to consternation, because in text processing if you can't rely on the string length function, what can you rely on?
As for phonemes, major ML researchers like Vincent Ng don't believe it to be a real unit of language. The argument is that we don't need phonemes for applications like speech recognition.
If not byte and character, what are we left with in terms of computational phenomenology? At the very least there has to be such a well-argued and well-evidenced alternative in order to try to persuade others to agree to your views. I would be very much interested in thinking about such an alternative even if at present I don't think you are right about all your views. After all, to throw away millenia of work on language-science, very strong reasoning and evidence for an alternative is not an unrealistic expectation.
On Thu, Oct 26, 2023 at 8:44 PM Gilles Sérasset via Corpora <corpora@list.elra.info mailto:corpora@list.elra.info> wrote:
Hi Ada,
When my niece was 3 year old, she said to her little brother “Maman, elle venira plus tard…” (Mum will come back later, in “incorrect” French).
She made a “mistake" here by using “venira” (a wrong future form for verb venir (to come)) instead of the “correct" “viendra”. It was wrong, but perfectly predictable using the most productive morphological rules of French future formation.
She was 3 years old, so I doubt she was really understanding what morphology is, nevertheless, with this mistake, she clearly showed me that her way of learning languages did not consisted in reading/listening to huge amounts of utterances but she was able to learn some word formation rules from very few examples. And indeed, human is still able to perfectly learn complex things with very small explanation and/or very few example (something that is totally beyond ML based language models).
In my humble opinion, this proves that morphology exists, if not in the LLM matrixes, at least in the human brain. Hence modelling such rules (and even using them to analyse or produce) is a valid approach, independently of any other (also valid) approaches.
If I want to say it another way :
There has been many scientific proofs that human will not be able to fly… And these proofs were valid under their own hypothesis.
Indeed, planes do not flap their wings… they are using other ways to perform a task that was performed by birds.
Nevertheless, I have never been the witness of any plane (or pilot) trying to convince birds that their way of flying is obsolete (or issued from a colonialist point of view of Aves on the task at hand…) and asking them to renounce this oh so obsolete bad habit.
Regards,
Gilles,

Corpora mailing list -- corpora@list.elra.info mailto:corpora@list.elra.info
https://list.elra.info/mailman3/postorius/lists/corpora.list.elra.info/ https://list.elra.info/mailman3/postorius/lists/corpora.list.elra.info/
To unsubscribe send an email to corpora-leave@list.elra.info mailto:corpora-leave@list.elra.info
--

Anil