[Corpora-List] Re: RANLP 2023 Call for Participation

30 Aug 2023

      Dear All,
I agree with Edyta's polite remarks.
I find the discussions below purely informative posts quite confusing, and
I am "losing track" of the original posts to the point that I fear I might
miss calls that could be relevant for my work, or miss discussions that are
worth joining. Before Edyta's remarks I was even considering leaving the
list because of the current situation in the list.
So, I join Edyta's kind request to keep discussions as separate threads and
leave call for papers/abstracts or job calls as purely informative posts.
 Perhaps opening a new, separate discussion thread might be an alternative
option that would allow us to filter the different kinds of communications
we received from the list.
Best wishes to everyone,
Daniela Cesiri
Il Mer 30 Ago 2023, 17:15 Edyta Jurkiewicz-Rohrbacher via Corpora <
corpora@list.elra.info> ha scritto:
...
Dear Ada, dear all,
I'm a bit concerned with what has been going with the list recently.
The list, as far as I understand, serves several purposes. One of them
is purely informative, where one informs the community about
potentially interesting jobs, conferences etc. If I open an answer to
a job advertisment, I expect it will be a question useful for the
potential applicants or correction about, for example, deadlines.
Another thing is to ask questions or start some discussions on
various topics, either theoretical or purely practical. There I will
expect people sharing their experience and opinions.
What I do not find ok, is giving the feedback to purely informational
posts in the way Ada does. In my opinion the discussions whether words
or sentences are up-to-date concepts in any  (general)linguistic or
computational linguistic framework should be led in separate threads.
(Notice also that the problem of text segmentation has been topic for
already long time.) Summing up, I wouldn't mind if Adas comments were
presented maybe privately to the authors of posts, or discussed in
separate list-mails. Otherwise, we are facing chaos here.
Summing up, I would be more than happy to participate, if discussions
about the relation between linguistics and NLP took place, but not
mixed with advertisments.
I hope I did not offend anybody with this message.
Best,
Edyta Jurkiewicz-Rohrbacher
śr., 30 sie 2023 o 16:35 Gilles Sérasset via Corpora
corpora@list.elra.info napisał(a):
...
Dear Ada, dear all,
I am not a linguist but a computational scientist which is quite used to
talk with (and tries to understand) linguists. I must say that I usually
read your mails as thoroughly as my schedule and patience allows me to,
but, to be honest, I also have a rather negative feeling when reading your
“discourse”.
...
In this discourse, I see facts + interpretation + rhetorics.
[Here I take the risk of caricaturing for the sake of shortness, I hope
you will understand that I have no time nor intention to really go deeply
in all the intricacies of your different claims as I am more a witness than
an actor of this scientific dispute]
...
My understanding of your facts: Neural models do not use the concept of
word in any of their tasks, but achieve very interesting results in their
modelling of the language.
...
My understanding of your interpretation: this is the proof that there is
no such thing as a word.
...
My understanding of your rhetoric: linguists are still using “words”, so
they are wrong or dishonest or miseducated or dumb, we should wipe out
entirely any occurence of this concept and start over with another
modelling of the language.
...
Please, understand that I am just presenting the way I am interpreting
your different messages. And even if I am wrong here, this interpretation
is to be taken into account as we are all persons with feeling. This
feeling is a fact, even if I do not particularly feel targeted by your
different criticisms. I hope this will help you ponder the terms involved
in your next messages.
...
This being said, I was not particularly surprised to see some
“passionate” replies to your different messages. And I agree with everyone
here, we should not go into such passion and use ad-hominem attacks on a
mailing list, AND you should also understand that most of your rhetoric do
contains such passion and attacks.
...
Concerning the facts :
You are right, Neural models does not use any notion of word (or word
morphology) as it is usually thought in linguistics as it usually first
decide what is the granularity with which it will aggregate its input
(sequence of characters) into tokens to which it attaches an
“interpretation” (modelled as a multi-dimensional vector).
...
Concerning the interpretation :

You want to wipe out the notion of word based on such a fact. I would

agree somehow if we were dealing with a universal modelling of language,
but this is not the case. Human model language in a certain way and neural
models in another way (even if neural networks are claimed to be inspired
by biological neurones in our brains). The fact that a concept does not
exist in a model does not entail that it does not exist in another model.
...

Also, you do make the very same mistake concerning the way you look

at the facts: i.e. there is no such thing as a character…, which means that
the input of NN is already flown with a bias with which we look at
language. Indeed characters are a very recent invention that builds on
different concerns:
...

usual graphical elements that are traditionally used in language

writing and that has been interpreted as atomic,
...

their interpretation by the encoding authorities (see the differences

and debates about code points vs characters)
...

arbitrary decision made (e.g. why model A and a as 2 different

characters?)
...
Moreover, all corpora are usually badly encoded by using one character
for another (quote instead of apostrophe, unbreakable character instead of
a space, …) and this only accounts for languages with a writing system or
transcription, i.e. not the majority of them.
...
The conclusion is that even Neural Network uses artificial bias in the
way they model language, which means that the conclusion we draw from them
are as flawed as the one we draw from the classical way linguists look at
languages.
...

Most serious linguists never defined “words” lightly and most of them

know that this concept is an "approximation” of something that is very
difficult to apprehend and seems to be more grounded into linguistics from
human introspection than linguistics from corpora. It somehow represents
the way our human brain aggregates the atoms of the language
(characters/phonemes) into something to which we associate an
interpretation. In this sense, it is somehow the “tokens” of our biological
neural network (and certainly far more).
...
As an utterance production is not a bijection between whatever we have
in our head and the sequential signal we use to communicate, I agree with
you on the fact that “words" are certainly not present in a corpus (but I
do think that our inner “tokens” may be observed somehow there).
...
Concerning the rhetoric:
I do not think any linguist or computational linguist is naive enough to
think that any of the modelling we deal with are a “truth” and I doubt any
of them is miseducated enough to think that “words” are clearly defined and
undoubtedly present in corpora. I do think though that they are usually
right to observe occurrences (or hints) of non atomic constructs we
associate with some interpretation. I also think that this way of looking
to a corpus has some advantages that are not really present in NN (for
instance, it can observe some regularity that will help human produce new
utterances without being shown a large amount of examples).
...
I also do think that even if you were totally right in your facts and
interpretations, asking for a denial of current/past ways of looking to the
texts will be a mistake. Even in physics, since the general theory of
relativity, we know the classical mechanics is wrong, however it is still
in use and it is not a problem as long as everybody know under which
hypothesis it is a good enough approximation and under which hypothesis it
does not work anymore.
...
I know this message will certainly not make you think differently, but
if it allows you to communicate differently with persons that still use the
terms “words" or “sentences" as a simple shortcut to position their work
into a shared/common understanding of the state of the art, in contexts
where there is no room for better explanation (e.g. in summaries of their
keynote speech), then I will have achieved something.
...
Hoping this scientifical debate will continue in an appeased manner,
Regards,
Gilles Sérasset,

Corpora mailing list -- corpora@list.elra.info
https://list.elra.info/mailman3/postorius/lists/corpora.list.elra.info/
To unsubscribe send an email to corpora-leave@list.elra.info

Corpora mailing list -- corpora@list.elra.info
https://list.elra.info/mailman3/postorius/lists/corpora.list.elra.info/
To unsubscribe send an email to corpora-leave@list.elra.info
-- 
Nota automatica aggiunta dal sistema di posta

*Sostieni il futuro*

Dona 
il tuo 5x1000 al Collegio Internazionale Ca' Foscari
*FINANZIAMENTO DELLA 
RICERCA SCIENTIFICA E DELLA UNIVERSITÀ | CODICE FISCALE: 80007720271*

2026

2025

2024

2023

2022

[Corpora-List] Re: RANLP 2023 Call for Participation