I have also been carefully reading the exchanges. Although I was planning not to add to this exchange, at this point I am tempted to reply.
Ada's early emails were adding something to the discussion and debate, but at this point they are simply saying 'I am right, you are wrong', without giving any explanation or evidence.
I was also thinking of the same kind of examples as given by Gilles. Till Ada provides some very good reasoning and evidence, it is hard for me to completely agree with her, although as I said earlier, I do agree with her on many, perhaps most of things.
Ada, I sincerely respect your learning and competence. However, you said earlier you are proposing an alternative computational phenomenology. That would be really interesting. Won't it be better to first propose it and argue in more specific terms and with more convincing arguments and evidence that it is the right one, or at least 'more right' than the existing ones (there are more than one). Given that there is already Information Theory, it has to go beyond byte, which is an accidental unit of computation, and character, which is also not well-defined, sometimes even for one specific writing system. To give one such example, perhaps not the best one, I always thought of Indic script dependent vowel (maatraa) as a character, but I recently found that languages like Java and Python do not treat such written symbols as character, so when I try to get the length of an Indic-script string, the in-built string length functions give only the number of consonant symbols and independent vowels in the string. We got wrong results using these functions and I only accidentally discovered that this is the case. The reason, of course, is that these functions and programming languages treat such dependent vowels as diacritics, which is also correct in some ways. I did not realize this earlier because in India we often use a Latin script-based notation called WX for Indic scripts in NLP due to the encoding and input method related problems that I referred to in one of my earlier replies. The WX notation, however, does not distinguish between dependent and independent vowels and treats both of them as the same character, which is how most of us, if not all, think of them in India to the best of my knowledge. On the other hand, the consonant symbol modifier 'halant' is not used in WX, but is used in Indic-scripts and its presence might also cause disagreements about what the string length is. In other words, character as a unit does not work in your terms. In fact, who knows how many errors for Indic script text have made their way into computational results due to this simple fact. And perhaps they still do because it took me a long time to realize this, which at first led to consternation, because in text processing if you can't rely on the string length function, what can you rely on?
As for phonemes, major ML researchers like Vincent Ng don't believe it to be a real unit of language. The argument is that we don't need phonemes for applications like speech recognition.
If not byte and character, what are we left with in terms of computational phenomenology? At the very least there has to be such a well-argued and well-evidenced alternative in order to try to persuade others to agree to your views. I would be very much interested in thinking about such an alternative even if at present I don't think you are right about all your views. After all, to throw away millenia of work on language-science, very strong reasoning and evidence for an alternative is not an unrealistic expectation.
On Thu, Oct 26, 2023 at 8:44 PM Gilles Sérasset via Corpora < corpora@list.elra.info> wrote:
Hi Ada,
When my niece was 3 year old, she said to her little brother “Maman, elle venira plus tard…” (Mum will come back later, in “incorrect” French).
She made a “mistake" here by using “venira” (a wrong future form for verb venir (to come)) instead of the “correct" “viendra”. It was wrong, but perfectly predictable using the most productive morphological rules of French future formation.
She was 3 years old, so I doubt she was really understanding what morphology is, nevertheless, with this mistake, she clearly showed me that her way of learning languages did not consisted in reading/listening to huge amounts of utterances but she was able to learn some word formation rules from very few examples. And indeed, human is still able to perfectly learn complex things with very small explanation and/or very few example (something that is totally beyond ML based language models).
In my humble opinion, this proves that morphology exists, if not in the LLM matrixes, at least in the human brain. Hence modelling such rules (and even using them to analyse or produce) is a valid approach, independently of any other (also valid) approaches.
If I want to say it another way :
There has been many scientific proofs that human will not be able to fly… And these proofs were valid under their own hypothesis.
Indeed, planes do not flap their wings… they are using other ways to perform a task that was performed by birds.
Nevertheless, I have never been the witness of any plane (or pilot) trying to convince birds that their way of flying is obsolete (or issued from a colonialist point of view of Aves on the task at hand…) and asking them to renounce this oh so obsolete bad habit.
Regards,
Gilles,
Corpora mailing list -- corpora@list.elra.info https://list.elra.info/mailman3/postorius/lists/corpora.list.elra.info/ To unsubscribe send an email to corpora-leave@list.elra.info