[Corpora-List] Re: Lemmas and Lemmatization [was Re: NIF: NLP Interchange Format]

26 Oct 2023


      I have also been carefully reading the exchanges. Although I was planning
not to add to this exchange, at this point I am tempted to reply.
Ada's early emails were adding something to the discussion and debate, but
at this point they are simply saying 'I am right, you are wrong', without
giving any explanation or evidence.
I was also thinking of the same kind of examples as given by Gilles. Till
Ada provides some very good reasoning and evidence, it is hard for me to
completely agree with her, although as I said earlier, I do agree with her
on many, perhaps most of things.
Ada, I sincerely respect your learning and competence. However, you said
earlier you are proposing an alternative computational phenomenology. That
would be really interesting. Won't it be better to first propose it and
argue in more specific terms and with more convincing arguments and
evidence that it is the right one, or at least 'more right' than the
existing ones (there are more than one). Given that there is already
Information Theory, it has to go beyond byte, which is an accidental unit
of computation, and character, which is also not well-defined, sometimes
even for one specific writing system. To give one such example, perhaps not
the best one, I always thought of Indic script dependent vowel (maatraa) as
a character, but I recently found that languages like Java and Python do
not treat such written symbols as character, so when I try to get the
length of an Indic-script string, the in-built string length functions give
only the number of consonant symbols and independent vowels in the string.
We got wrong results using these functions and I only accidentally
discovered that this is the case. The reason, of course, is that these
functions and programming languages treat such dependent vowels as
diacritics, which is also correct in some ways. I did not realize this
earlier because in India we often use a Latin script-based notation called
WX for Indic scripts in NLP due to the encoding and input method related
problems that I referred to in one of my earlier replies. The WX notation,
however, does not distinguish between dependent and independent vowels and
treats both of them as the same character, which is how most of us, if not
all, think of them in India to the best of my knowledge. On the other hand,
the consonant symbol modifier 'halant' is not used in WX, but is used in
Indic-scripts and its presence might also cause disagreements about what
the string length is. In other words, character as a unit does not work in
your terms. In fact, who knows how many errors for Indic script text have
made their way into computational results due to this simple fact. And
perhaps they still do because it took me a long time to realize this, which
at first led to consternation, because in text processing if you can't rely
on the string length function, what can you rely on?
As for phonemes, major ML researchers like Vincent Ng don't believe it to
be a real unit of language. The argument is that we don't need phonemes for
applications like speech recognition.
If not byte and character, what are we left with in terms of computational
phenomenology? At the very least there has to be such a well-argued and
well-evidenced alternative in order to try to persuade others to agree to
your views. I would be very much interested in thinking about such an
alternative even if at present I don't think you are right about all your
views. After all, to throw away millenia of work on language-science, very
strong reasoning and evidence for an alternative is not an unrealistic
expectation.
On Thu, Oct 26, 2023 at 8:44 PM Gilles Sérasset via Corpora <
corpora@list.elra.info> wrote:
...
Hi Ada,
When my niece was 3 year old, she said to her little brother “Maman, elle
venira plus tard…” (Mum will come back later, in “incorrect” French).
She made a “mistake" here by using “venira” (a wrong future form for verb
venir (to come)) instead of the “correct" “viendra”. It was wrong, but
perfectly predictable using the most productive morphological rules of
French future formation.
She was 3 years old, so I doubt she was really understanding what
morphology is, nevertheless, with this mistake, she clearly showed me that
her way of learning languages did not consisted in reading/listening to
huge amounts of utterances but she was able to learn some word formation
rules from very few examples. And indeed, human is still able to perfectly
learn complex things with very small explanation and/or very few example
(something that is totally beyond ML based language models).
In my humble opinion, this proves that morphology exists, if not in the
LLM matrixes, at least in the human brain. Hence modelling such rules (and
even using them to analyse or produce) is a valid approach, independently
of any other (also valid) approaches.
If I want to say it another way :
There has been many scientific proofs that human will not be able to fly…
And these proofs were valid under their own hypothesis.
Indeed, planes do not flap their wings… they are using other ways to
perform a task that was performed by birds.
Nevertheless, I have never been the witness of any plane (or pilot) trying
to convince birds that their way of flying is obsolete (or issued from a
colonialist point of view of Aves on the task at hand…) and asking them to
renounce this oh so obsolete bad habit.
Regards,
Gilles,

Corpora mailing list -- corpora@list.elra.info
https://list.elra.info/mailman3/postorius/lists/corpora.list.elra.info/
To unsubscribe send an email to corpora-leave@list.elra.info
-- 
- Anil

2026

2025

2024

2023

2022

[Corpora-List] Re: Lemmas and Lemmatization [was Re: NIF: NLP Interchange Format]