RANLP 2023 Call for Participation
We are pleased to share the programme of the international conference 'Recent Advances in Natural Language Processing' (RANLP'2023). To view the programme, please click here https://ranlp.org/ranlp2023/index.php/main-conference-programme/
To register, please visit https://ranlp.org/ranlp2023/index.php/fees-registration/
We very much hope to welcoming you at RANLP'2023 in Varna!
Dear RANLP organizers
I looked through your program and have a few questions:
i. I noticed there is a parallel session for "Sentence-level Representation and Analysis". Would you mind please letting me (or us all on this list) know why "sentence(s)" would be relevant and necessary in computing? Does the term "sentence" refer to line (as delimited by line breaks)? Or are you segmenting via some heuristics, e.g. with some punctuation indicators, for each dataset --- if so, would there not be a concern on grounds of fairness and diversity as well as on robustness, sufficiency, and applicability? [I also understand that in NLP there has been (an undue) grammarian influence, leading to the false assumption that processing/evaluation based on "sentences" would be necessary (when one can do so based on a wider span of text instead). As neither machines nor humans need the concept of "sentence" to produce/understand language, I wonder: wouldn't the restriction to "sentence"-level analyses lead to overfitting?]
ii. Re "multilinguality": would that be a session in which work would focus on computationally relevant topics such as character encoding (see https://openreview.net/forum?id=-llS6TiOew)? Or would it further perpetuate the adverse effects of grammatical teachings and concepts?
iii. Re Isabelle's keynote: I hope her social scientific studies would be ones leveraging comprehensive data (i.e. not ones with selection bias) and rigorous statistical testing. There are many ethical aspects in the experimentation aspect(s) in the social sciences, including but not limited to hypothesis formulation, that one should be very careful about. Otherwise, the study could be interpreted as sentiment/identity manipulation (e.g. your formulation with "origin" [1]). There has been some work in the CL/NLP space that touches on identity politics in ways that may not be necessary/inappropriate, hence my remark here.
iv. Re Ed's keynote on "neuro-symbolic approaches": I have previously replied to Alexander Koller's call on such for his DFG project on 16Jul2023 on this mailing list, as follows: "As we know, neural models are statistical models in nature. Symbolic representations could create/reinforce unnecessary circularity. The symbolic representations could obfuscate the precision needed. The findings of Mielke et al. (2019) https://arxiv.org/abs/1906.04726 and Wan (2022) https://openreview.net/forum?id=-llS6TiOew were a painful/bitter lesson to many. I'd hate to see another generation of students being misled."
(To Ed: I suspect that you are already familiar with these works. So I wonder what "symbolic approaches" refer to in your case, and whether they are being applied as a post-processing (e.g. post ML) strategy. If so, and if these are based on "grammar" etc., please be careful as it is not necessary for processing. One can post-edit ML-outputted texts according to some stylistic preferences as part of post-processing heuristics --- but I have concerns as to how much the employment of such heuristics could get abused. As you may know, many CL/NLPers might already be too "hooked" on grammar and textual representations. There are many dependencies to grammar teaching, so ethical concerns in pedagogy of such need to be considered.)
v. Re Sandra's keynote: Please see literature mentioned in [iv] above. Re "[u]sing transformers for hate speech detection tends to give good results ...": how do these results and domain effects reconcile with data statistics, even if/when one does not segment text into "words" or "sentences" (or any categories that grammarians like and many CL/NLPers used to be "addicted" to)? It might be a higher bar, but there is work to do to see where such correspondences (between language phenomena and data statistics, for example) exist and if they do! (And if not, negative results are also results! And with any data processing/interpretation, it's information, not "meaning", that matters.)
vi. Re Efstathios' keynote: Please see notes above.
As language sciences (e.g. Linguistics) and NLP are still taught at some universities, i.e. part of publicly accessible education, there is a general responsibility that one should bear when promoting/hosting events that would be explicitly/implicitly supporting biases and/or in violation of scientific integrity.
Thank you for reading, for tolerating my rant here. There has been some "bad research" (and some miseducation) in the area of CL/NLP. Hence, I thought to send a reminder (and call for action/correction).
Thanks and best Ada
[1] "origin": (mother's womb? [Jest... but yes and no.]) How current is this analysis from a globalized perspective? Do people categorically use language in one way or another based on some "types" related to "origin" (whatever that refers to), or more based on contexts and/or habits (personal and/or group-based, if the latter, what "group identity" is assumed in the data and in the experiment)?
On Thu, Aug 17, 2023 at 11:17 AM amalhaddad--- via Corpora < corpora@list.elra.info> wrote:
RANLP 2023 Call for Participation
We are pleased to share the programme of the international conference ‘Recent Advances in Natural Language Processing’ (RANLP’2023). To view the programme, please click here https://ranlp.org/ranlp2023/index.php/main-conference-programme/
To register, please visit https://ranlp.org/ranlp2023/index.php/fees-registration/
We very much hope to welcoming you at RANLP’2023 in Varna! _______________________________________________ Corpora mailing list -- corpora@list.elra.info https://list.elra.info/mailman3/postorius/lists/corpora.list.elra.info/ To unsubscribe send an email to corpora-leave@list.elra.info
Hi Ada,
It's understandable that enthusiasm can sometimes lead to excessive engagement, but your disruptive posting on the mailing list has reached an intolerable level. Please keep your conversations private instead of spamming everyone and curb your enthusiasm. Your obnoxious behavior reflects poorly on you.
Thanks.
Fully agree with you Ben. Rodolfo
Il lun 21 ago 2023, 01:00 Ben Sir via Corpora corpora@list.elra.info ha scritto:
Hi Ada,
It's understandable that enthusiasm can sometimes lead to excessive engagement, but your disruptive posting on the mailing list has reached an intolerable level. Please keep your conversations private instead of spamming everyone and curb your enthusiasm. Your obnoxious behavior reflects poorly on you.
Thanks. _______________________________________________ Corpora mailing list -- corpora@list.elra.info https://list.elra.info/mailman3/postorius/lists/corpora.list.elra.info/ To unsubscribe send an email to corpora-leave@list.elra.info
Can’t agree more. Toms
From: Rodolfo Delmonte via Corpora corpora@list.elra.info Sent: Monday, August 21, 2023 10:06 AM To: Ben Sir benoit.siroit@gmail.com Cc: corpora corpora@list.elra.info Subject: [Corpora-List] Re: RANLP 2023 Call for Participation
Fully agree with you Ben. Rodolfo
Il lun 21 ago 2023, 01:00 Ben Sir via Corpora <corpora@list.elra.infomailto:corpora@list.elra.info> ha scritto: Hi Ada,
It's understandable that enthusiasm can sometimes lead to excessive engagement, but your disruptive posting on the mailing list has reached an intolerable level. Please keep your conversations private instead of spamming everyone and curb your enthusiasm. Your obnoxious behavior reflects poorly on you.
Thanks. _______________________________________________ Corpora mailing list -- corpora@list.elra.infomailto:corpora@list.elra.info https://list.elra.info/mailman3/postorius/lists/corpora.list.elra.info/ To unsubscribe send an email to corpora-leave@list.elra.infomailto:corpora-leave@list.elra.info
Nota automatica aggiunta dal sistema di posta
Sostieni il futuro Dona il tuo 5x1000 al Collegio Internazionale Ca' Foscari FINANZIAMENTO DELLA RICERCA SCIENTIFICA E DELLA UNIVERSITÀ | CODICE FISCALE: 80007720271
Dear Ben, Rodolfo, and Toms
Please accept that there is a responsibility to science, technology, engineering, and education (or anything that we undertake).
If you could point out the specific arguments as to which of what I wrote may be problematic to you, perhaps we can have a constructive exchange. The way in which you three expressed your sentiments on this thread can be interpreted as mobbing.
Please note the intent behind my statement and lend me the benefit of a doubt as to why I would have invested my time and energy to write the reply that I did to the list: "As language sciences (e.g. Linguistics) and NLP are still taught at some universities, i.e. part of publicly accessible education, there is a general responsibility that one should bear when promoting/hosting events that would be explicitly/implicitly supporting biases and/or in violation of scientific integrity." This applies to the whole area of computing, including digital humanities and the computational social sciences.* In short, there are no symbolic concepts relevant in computing / computational processing.* I am sorry if that has not been clear.
I understand that there are members in the CL/NLP community/communities who might be interested in (or used/addicted to) "word" hacking. But it is now high time to stop.
@Ben: Please note that I am not doing this "for fun". I am not trying to ridicule anyone. My remarks are not ad personam. For each of the research directions/practices that I commented on, there are opportunities for all practitioners to do a better job, to refine our analyses.
Thanks and best Ada
On Mon, Aug 21, 2023 at 9:45 AM Toms Bergmanis via Corpora < corpora@list.elra.info> wrote:
Can’t agree more.
Toms
*From:* Rodolfo Delmonte via Corpora corpora@list.elra.info *Sent:* Monday, August 21, 2023 10:06 AM *To:* Ben Sir benoit.siroit@gmail.com *Cc:* corpora corpora@list.elra.info *Subject:* [Corpora-List] Re: RANLP 2023 Call for Participation
Fully agree with you Ben.
Rodolfo
Il lun 21 ago 2023, 01:00 Ben Sir via Corpora corpora@list.elra.info ha scritto:
Hi Ada,
It's understandable that enthusiasm can sometimes lead to excessive engagement, but your disruptive posting on the mailing list has reached an intolerable level. Please keep your conversations private instead of spamming everyone and curb your enthusiasm. Your obnoxious behavior reflects poorly on you.
Thanks. _______________________________________________ Corpora mailing list -- corpora@list.elra.info https://list.elra.info/mailman3/postorius/lists/corpora.list.elra.info/ To unsubscribe send an email to corpora-leave@list.elra.info
Nota automatica aggiunta dal sistema di posta
*Sostieni il futuro*
Dona il tuo 5x1000 al Collegio Internazionale Ca' Foscari
*FINANZIAMENTO DELLA RICERCA SCIENTIFICA E DELLA UNIVERSITÀ | CODICE FISCALE: 80007720271* _______________________________________________ Corpora mailing list -- corpora@list.elra.info https://list.elra.info/mailman3/postorius/lists/corpora.list.elra.info/ To unsubscribe send an email to corpora-leave@list.elra.info
Amendment: In short, there are no symbolic concepts relevant in computing / computational processing except for those which also align with statistics. (There are various levels of assumptions/abstractions that could be relevant depending on the goals/tasks. But much of what one might have been doing in "symbolic computing" surely deserves a critical re-examination.
On Mon, Aug 21, 2023 at 4:48 PM Ada Wan adawan919@gmail.com wrote:
Dear Ben, Rodolfo, and Toms
Please accept that there is a responsibility to science, technology, engineering, and education (or anything that we undertake).
If you could point out the specific arguments as to which of what I wrote may be problematic to you, perhaps we can have a constructive exchange. The way in which you three expressed your sentiments on this thread can be interpreted as mobbing.
Please note the intent behind my statement and lend me the benefit of a doubt as to why I would have invested my time and energy to write the reply that I did to the list: "As language sciences (e.g. Linguistics) and NLP are still taught at some universities, i.e. part of publicly accessible education, there is a general responsibility that one should bear when promoting/hosting events that would be explicitly/implicitly supporting biases and/or in violation of scientific integrity." This applies to the whole area of computing, including digital humanities and the computational social sciences.* In short, there are no symbolic concepts relevant in computing / computational processing.* I am sorry if that has not been clear.
I understand that there are members in the CL/NLP community/communities who might be interested in (or used/addicted to) "word" hacking. But it is now high time to stop.
@Ben: Please note that I am not doing this "for fun". I am not trying to ridicule anyone. My remarks are not ad personam. For each of the research directions/practices that I commented on, there are opportunities for all practitioners to do a better job, to refine our analyses.
Thanks and best Ada
On Mon, Aug 21, 2023 at 9:45 AM Toms Bergmanis via Corpora < corpora@list.elra.info> wrote:
Can’t agree more.
Toms
*From:* Rodolfo Delmonte via Corpora corpora@list.elra.info *Sent:* Monday, August 21, 2023 10:06 AM *To:* Ben Sir benoit.siroit@gmail.com *Cc:* corpora corpora@list.elra.info *Subject:* [Corpora-List] Re: RANLP 2023 Call for Participation
Fully agree with you Ben.
Rodolfo
Il lun 21 ago 2023, 01:00 Ben Sir via Corpora corpora@list.elra.info ha scritto:
Hi Ada,
It's understandable that enthusiasm can sometimes lead to excessive engagement, but your disruptive posting on the mailing list has reached an intolerable level. Please keep your conversations private instead of spamming everyone and curb your enthusiasm. Your obnoxious behavior reflects poorly on you.
Thanks. _______________________________________________ Corpora mailing list -- corpora@list.elra.info https://list.elra.info/mailman3/postorius/lists/corpora.list.elra.info/ To unsubscribe send an email to corpora-leave@list.elra.info
Nota automatica aggiunta dal sistema di posta
*Sostieni il futuro*
Dona il tuo 5x1000 al Collegio Internazionale Ca' Foscari
*FINANZIAMENTO DELLA RICERCA SCIENTIFICA E DELLA UNIVERSITÀ | CODICE FISCALE: 80007720271* _______________________________________________ Corpora mailing list -- corpora@list.elra.info https://list.elra.info/mailman3/postorius/lists/corpora.list.elra.info/ To unsubscribe send an email to corpora-leave@list.elra.info
Dear all on the Corpora-List
I understand it is possible that some of you may harbor some negative sentiments towards me and/or my recent replies on the list. That having been expressed, I would like to remind everyone on this list it is important to understand that many subjects such as computational [x, where x can be e.g. linguistics, biology, physics, modeling...], digital humanities, data analytics, data science, and many of their dependencies have been / are in the public domain, much of which academic and scientific in nature. Science is in the public domain.
What we are experiencing here is sort of a computational and statistical turn in the computational sciences and studies --- anything that involves data (computational and otherwise). Previously (or even currently in many disciplines/practices), one has modeled / has been modeling many symbolic concepts and values computationally, directly inheriting these from "traditional sciences" (i.e. sciences from a time when all was done without any computational machinery), assuming that these values and the relationship between such would not only hold but also hold as the only ground truth. But as e.g. my results have shown, many of these scientific concepts, values, and relationships deserve to be re-evaluated and re-interpreted.
What I have been trying to do is to communicate this, as without any updates and/or self-correction, we could be experiencing many discrepancies in our experimental results. Good scientific practice (including good assumptions therefor) is fundamental to everyone. This includes but is not limited to having good assumptions, leveraging appropriate methods, being responsible in evaluation as well as addressing ethical concerns, e.g. in the case of my findings: a combination of false assumptions and miseducation. (Sorry to re-iterate this but it is just such an important lesson for many on this list... it may be painful for some too.)
Corpora-list might have changed more or less like how the field of CL/NLP has in the past decades. While these areas might have become more generalized and thus the audience more "diverse" in terms of background and areas of familiarity, there are certainly some on this list who are concerned about some of the "bad" science/values that could get propagated through the use of data/corpora. That is one of the reasons behind my many replies of late.
*If you should find my comments/replies an issue of concern, please let me know what in specifics you disagree with. I'd be happy to modify my formulations or discuss further. If you think I have been wrong somewhere, please do let me know. I'd be happy to update. *
Thanks and best Ada
On Mon, Aug 21, 2023 at 5:39 PM Ada Wan adawan919@gmail.com wrote:
Amendment: In short, there are no symbolic concepts relevant in computing / computational processing except for those which also align with statistics. (There are various levels of assumptions/abstractions that could be relevant depending on the goals/tasks. But much of what one might have been doing in "symbolic computing" surely deserves a critical re-examination.
On Mon, Aug 21, 2023 at 4:48 PM Ada Wan adawan919@gmail.com wrote:
Dear Ben, Rodolfo, and Toms
Please accept that there is a responsibility to science, technology, engineering, and education (or anything that we undertake).
If you could point out the specific arguments as to which of what I wrote may be problematic to you, perhaps we can have a constructive exchange. The way in which you three expressed your sentiments on this thread can be interpreted as mobbing.
Please note the intent behind my statement and lend me the benefit of a doubt as to why I would have invested my time and energy to write the reply that I did to the list: "As language sciences (e.g. Linguistics) and NLP are still taught at some universities, i.e. part of publicly accessible education, there is a general responsibility that one should bear when promoting/hosting events that would be explicitly/implicitly supporting biases and/or in violation of scientific integrity." This applies to the whole area of computing, including digital humanities and the computational social sciences.* In short, there are no symbolic concepts relevant in computing / computational processing.* I am sorry if that has not been clear.
I understand that there are members in the CL/NLP community/communities who might be interested in (or used/addicted to) "word" hacking. But it is now high time to stop.
@Ben: Please note that I am not doing this "for fun". I am not trying to ridicule anyone. My remarks are not ad personam. For each of the research directions/practices that I commented on, there are opportunities for all practitioners to do a better job, to refine our analyses.
Thanks and best Ada
On Mon, Aug 21, 2023 at 9:45 AM Toms Bergmanis via Corpora < corpora@list.elra.info> wrote:
Can’t agree more.
Toms
*From:* Rodolfo Delmonte via Corpora corpora@list.elra.info *Sent:* Monday, August 21, 2023 10:06 AM *To:* Ben Sir benoit.siroit@gmail.com *Cc:* corpora corpora@list.elra.info *Subject:* [Corpora-List] Re: RANLP 2023 Call for Participation
Fully agree with you Ben.
Rodolfo
Il lun 21 ago 2023, 01:00 Ben Sir via Corpora corpora@list.elra.info ha scritto:
Hi Ada,
It's understandable that enthusiasm can sometimes lead to excessive engagement, but your disruptive posting on the mailing list has reached an intolerable level. Please keep your conversations private instead of spamming everyone and curb your enthusiasm. Your obnoxious behavior reflects poorly on you.
Thanks. _______________________________________________ Corpora mailing list -- corpora@list.elra.info https://list.elra.info/mailman3/postorius/lists/corpora.list.elra.info/ To unsubscribe send an email to corpora-leave@list.elra.info
Nota automatica aggiunta dal sistema di posta
*Sostieni il futuro*
Dona il tuo 5x1000 al Collegio Internazionale Ca' Foscari
*FINANZIAMENTO DELLA RICERCA SCIENTIFICA E DELLA UNIVERSITÀ | CODICE FISCALE: 80007720271* _______________________________________________ Corpora mailing list -- corpora@list.elra.info https://list.elra.info/mailman3/postorius/lists/corpora.list.elra.info/ To unsubscribe send an email to corpora-leave@list.elra.info
Dear all,
I was shocked to see a vitriolic ad-hominem attack on a colleague posted to this mailing list. It is entirely inappropriate to post this type of diatribe against an individual even though someone might disagree with either the tone or the content of an individual's messages or arguments. The fact that other members of the community chimed in to reinforce the attack is also appalling and entirely inappropriate.
Sincerely,
Gully Burns
On Tue, Aug 22, 2023 at 1:23 PM Ada Wan via Corpora corpora@list.elra.info wrote:
Dear all on the Corpora-List
I understand it is possible that some of you may harbor some negative sentiments towards me and/or my recent replies on the list. That having been expressed, I would like to remind everyone on this list it is important to understand that many subjects such as computational [x, where x can be e.g. linguistics, biology, physics, modeling...], digital humanities, data analytics, data science, and many of their dependencies have been / are in the public domain, much of which academic and scientific in nature. Science is in the public domain.
What we are experiencing here is sort of a computational and statistical turn in the computational sciences and studies --- anything that involves data (computational and otherwise). Previously (or even currently in many disciplines/practices), one has modeled / has been modeling many symbolic concepts and values computationally, directly inheriting these from "traditional sciences" (i.e. sciences from a time when all was done without any computational machinery), assuming that these values and the relationship between such would not only hold but also hold as the only ground truth. But as e.g. my results have shown, many of these scientific concepts, values, and relationships deserve to be re-evaluated and re-interpreted.
What I have been trying to do is to communicate this, as without any updates and/or self-correction, we could be experiencing many discrepancies in our experimental results. Good scientific practice (including good assumptions therefor) is fundamental to everyone. This includes but is not limited to having good assumptions, leveraging appropriate methods, being responsible in evaluation as well as addressing ethical concerns, e.g. in the case of my findings: a combination of false assumptions and miseducation. (Sorry to re-iterate this but it is just such an important lesson for many on this list... it may be painful for some too.)
Corpora-list might have changed more or less like how the field of CL/NLP has in the past decades. While these areas might have become more generalized and thus the audience more "diverse" in terms of background and areas of familiarity, there are certainly some on this list who are concerned about some of the "bad" science/values that could get propagated through the use of data/corpora. That is one of the reasons behind my many replies of late.
*If you should find my comments/replies an issue of concern, please let me know what in specifics you disagree with. I'd be happy to modify my formulations or discuss further. If you think I have been wrong somewhere, please do let me know. I'd be happy to update. *
Thanks and best Ada
On Mon, Aug 21, 2023 at 5:39 PM Ada Wan adawan919@gmail.com wrote:
Amendment: In short, there are no symbolic concepts relevant in computing / computational processing except for those which also align with statistics. (There are various levels of assumptions/abstractions that could be relevant depending on the goals/tasks. But much of what one might have been doing in "symbolic computing" surely deserves a critical re-examination.
On Mon, Aug 21, 2023 at 4:48 PM Ada Wan adawan919@gmail.com wrote:
Dear Ben, Rodolfo, and Toms
Please accept that there is a responsibility to science, technology, engineering, and education (or anything that we undertake).
If you could point out the specific arguments as to which of what I wrote may be problematic to you, perhaps we can have a constructive exchange. The way in which you three expressed your sentiments on this thread can be interpreted as mobbing.
Please note the intent behind my statement and lend me the benefit of a doubt as to why I would have invested my time and energy to write the reply that I did to the list: "As language sciences (e.g. Linguistics) and NLP are still taught at some universities, i.e. part of publicly accessible education, there is a general responsibility that one should bear when promoting/hosting events that would be explicitly/implicitly supporting biases and/or in violation of scientific integrity." This applies to the whole area of computing, including digital humanities and the computational social sciences.* In short, there are no symbolic concepts relevant in computing / computational processing.* I am sorry if that has not been clear.
I understand that there are members in the CL/NLP community/communities who might be interested in (or used/addicted to) "word" hacking. But it is now high time to stop.
@Ben: Please note that I am not doing this "for fun". I am not trying to ridicule anyone. My remarks are not ad personam. For each of the research directions/practices that I commented on, there are opportunities for all practitioners to do a better job, to refine our analyses.
Thanks and best Ada
On Mon, Aug 21, 2023 at 9:45 AM Toms Bergmanis via Corpora < corpora@list.elra.info> wrote:
Can’t agree more.
Toms
*From:* Rodolfo Delmonte via Corpora corpora@list.elra.info *Sent:* Monday, August 21, 2023 10:06 AM *To:* Ben Sir benoit.siroit@gmail.com *Cc:* corpora corpora@list.elra.info *Subject:* [Corpora-List] Re: RANLP 2023 Call for Participation
Fully agree with you Ben.
Rodolfo
Il lun 21 ago 2023, 01:00 Ben Sir via Corpora corpora@list.elra.info ha scritto:
Hi Ada,
It's understandable that enthusiasm can sometimes lead to excessive engagement, but your disruptive posting on the mailing list has reached an intolerable level. Please keep your conversations private instead of spamming everyone and curb your enthusiasm. Your obnoxious behavior reflects poorly on you.
Thanks. _______________________________________________ Corpora mailing list -- corpora@list.elra.info https://list.elra.info/mailman3/postorius/lists/corpora.list.elra.info/ To unsubscribe send an email to corpora-leave@list.elra.info
Nota automatica aggiunta dal sistema di posta
*Sostieni il futuro*
Dona il tuo 5x1000 al Collegio Internazionale Ca' Foscari
*FINANZIAMENTO DELLA RICERCA SCIENTIFICA E DELLA UNIVERSITÀ | CODICE FISCALE: 80007720271* _______________________________________________ Corpora mailing list -- corpora@list.elra.info https://list.elra.info/mailman3/postorius/lists/corpora.list.elra.info/ To unsubscribe send an email to corpora-leave@list.elra.info
Corpora mailing list -- corpora@list.elra.info https://list.elra.info/mailman3/postorius/lists/corpora.list.elra.info/ To unsubscribe send an email to corpora-leave@list.elra.info
Dear All,
I would like to warmly suggest/remind the following to all of us (as a friendly suggestion, on which I will not follow up):
* One can find online good examples for the *"netiquette"* of mailing lists to reduce problems (see here https://www.snort.org/faq/what-is-the-mailing-list-etiquette, here https://en.opensuse.org/openSUSE:Mailing_list_netiquette and here https://sites.ualberta.ca/~pletendr/list-net.html for examples, which can be useful for all of us). * Please, let us always remember that there is a *person with feelings * on the other side of a communication. We need to gently and respectfully handle cases where we have objections. * If you feel that a conversation grows too big or is somehow problematic, address a */personal/**e-mail to a main contributor suggesting nicely an alternative* you consider more appropriate. If this fails systematically, then scale it up through a list moderator (or the list itself) politely. * Specific *suggestions for appropriate digital spaces* that can hold e.g. long discussions may allow all such discussion to find their own nest after a given point, so that we all have a common additional resource connected to the list, for topics that do need the added interaction. * If you feel that a topic you contribute to really ignites interesting conversation or if you simply receive an e-mail suggesting you to move a long conversation elsewhere due to its size, *consider an alternative* (or even ask the list for one), to facilitate the use of the mailing list itself. * Let us remember that what is *uninteresting to us may be interesting to others*.
As a final comment, before best practices comes *common understanding* and *good will*. Let us primarily build on these, as we have done in this list for many years.
Having said the above, I would like to thank Ada (and all the others) for the contributions (past, current and future) and discussions that keep this list alive.
Best regards, George G.
P.S. I would also like to thank Gully for trying to keep the list humane.
On 23/8/23 00:53, Gully Burns via Corpora wrote:
Dear all,
I was shocked to see a vitriolic ad-hominem attack on a colleague posted to this mailing list. It is entirely inappropriate to post this type of diatribe against an individual even though someone might disagree with either the tone or the content of an individual's messages or arguments. The fact that other members of the community chimed in to reinforce the attack is also appalling and entirely inappropriate.
Sincerely,
Gully Burns
On Tue, Aug 22, 2023 at 1:23 PM Ada Wan via Corpora corpora@list.elra.info wrote:
Dear all on the Corpora-List I understand it is possible that some of you may harbor some negative sentiments towards me and/or my recent replies on the list. That having been expressed, I would like to remind everyone on this list it is important to understand that many subjects such as computational [x, where x can be e.g. linguistics, biology, physics, modeling...], digital humanities, data analytics, data science, and many of their dependencies have been / are in the public domain, much of which academic and scientific in nature. Science is in the public domain. What we are experiencing here is sort of a computational and statistical turn in the computational sciences and studies --- anything that involves data (computational and otherwise). Previously (or even currently in many disciplines/practices), one has modeled / has been modeling many symbolic concepts and values computationally, directly inheriting these from "traditional sciences" (i.e. sciences from a time when all was done without any computational machinery), assuming that these values and the relationship between such would not only hold but also hold as the only ground truth. But as e.g. my results have shown, many of these scientific concepts, values, and relationships deserve to be re-evaluated and re-interpreted. What I have been trying to do is to communicate this, as without any updates and/or self-correction, we could be experiencing many discrepancies in our experimental results. Good scientific practice (including good assumptions therefor) is fundamental to everyone. This includes but is not limited to having good assumptions, leveraging appropriate methods, being responsible in evaluation as well as addressing ethical concerns, e.g. in the case of my findings: a combination of false assumptions and miseducation. (Sorry to re-iterate this but it is just such an important lesson for many on this list... it may be painful for some too.) Corpora-list might have changed more or less like how the field of CL/NLP has in the past decades. While these areas might have become more generalized and thus the audience more "diverse" in terms of background and areas of familiarity, there are certainly some on this list who are concerned about some of the "bad" science/values that could get propagated through the use of data/corpora. That is one of the reasons behind my many replies of late. * * *If you should find my comments/replies an issue of concern, please let me know what in specifics you disagree with. I'd be happy to modify my formulations or discuss further. If you think I have been wrong somewhere, please do let me know. I'd be happy to update. * Thanks and best Ada On Mon, Aug 21, 2023 at 5:39 PM Ada Wan <adawan919@gmail.com> wrote: Amendment: In short, there are no symbolic concepts relevant in computing / computational processing except for those which also align with statistics. (There are various levels of assumptions/abstractions that could be relevant depending on the goals/tasks. But much of what one might have been doing in "symbolic computing" surely deserves a critical re-examination. On Mon, Aug 21, 2023 at 4:48 PM Ada Wan <adawan919@gmail.com> wrote: Dear Ben, Rodolfo, and Toms Please accept that there is a responsibility to science, technology, engineering, and education (or anything that we undertake). If you could point out the specific arguments as to which of what I wrote may be problematic to you, perhaps we can have a constructive exchange. The way in which you three expressed your sentiments on this thread can be interpreted as mobbing. Please note the intent behind my statement and lend me the benefit of a doubt as to why I would have invested my time and energy to write the reply that I did to the list: "As language sciences (e.g. Linguistics) and NLP are still taught at some universities, i.e. part of publicly accessible education, there is a general responsibility that one should bear when promoting/hosting events that would be explicitly/implicitly supporting biases and/or in violation of scientific integrity." This applies to the whole area of computing, including digital humanities and the computational social sciences.*In short, there are no symbolic concepts relevant in computing / computational processing.* I am sorry if that has not been clear. I understand that there are members in the CL/NLP community/communities who might be interested in (or used/addicted to) "word" hacking. But it is now high time to stop. @Ben: Please note that I am not doing this "for fun". I am not trying to ridicule anyone. My remarks are not ad personam. For each of the research directions/practices that I commented on, there are opportunities for all practitioners to do a better job, to refine our analyses. Thanks and best Ada On Mon, Aug 21, 2023 at 9:45 AM Toms Bergmanis via Corpora <corpora@list.elra.info> wrote: Can’t agree more. Toms *From:* Rodolfo Delmonte via Corpora <corpora@list.elra.info> *Sent:* Monday, August 21, 2023 10:06 AM *To:* Ben Sir <benoit.siroit@gmail.com> *Cc:* corpora <corpora@list.elra.info> *Subject:* [Corpora-List] Re: RANLP 2023 Call for Participation Fully agree with you Ben. Rodolfo Il lun 21 ago 2023, 01:00 Ben Sir via Corpora <corpora@list.elra.info> ha scritto: Hi Ada, It's understandable that enthusiasm can sometimes lead to excessive engagement, but your disruptive posting on the mailing list has reached an intolerable level. Please keep your conversations private instead of spamming everyone and curb your enthusiasm. Your obnoxious behavior reflects poorly on you. Thanks. _______________________________________________ Corpora mailing list -- corpora@list.elra.info https://list.elra.info/mailman3/postorius/lists/corpora.list.elra.info/ To unsubscribe send an email to corpora-leave@list.elra.info Nota automatica aggiunta dal sistema di posta *Sostieni il futuro* Dona il tuo 5x1000 al Collegio Internazionale Ca' Foscari *FINANZIAMENTO DELLA RICERCA SCIENTIFICA E DELLA UNIVERSITÀ | CODICE FISCALE: 80007720271* _______________________________________________ Corpora mailing list -- corpora@list.elra.info https://list.elra.info/mailman3/postorius/lists/corpora.list.elra.info/ To unsubscribe send an email to corpora-leave@list.elra.info _______________________________________________ Corpora mailing list -- corpora@list.elra.info https://list.elra.info/mailman3/postorius/lists/corpora.list.elra.info/ To unsubscribe send an email to corpora-leave@list.elra.info
Corpora mailing list --corpora@list.elra.info https://list.elra.info/mailman3/postorius/lists/corpora.list.elra.info/ To unsubscribe send an email tocorpora-leave@list.elra.info
Dear all
Thank you for all your feedback.
As George mentioned in his reply "Please, let us always remember that there is a *person with feelings *on the other side of a communication. We need to gently and respectfully handle cases where we have objections.", I cannot agree more.
There is a certain degree of empathy that one needs to exercise in reading, writing, and in research (even for technical research, esp. if one has only been educated in one discipline. If one does not understand why other disciplines might have different assumptions and developmental histories or (perceived) narratives, it is best to check/verify that first before "attacking" others or their arguments. Interdisciplinary/Transdisciplinary work is also difficult for that reason (e.g. in translating/addressing/aligning assumptions/expectations). As Jonas noted, "(And everyone on the list -- I am sending this to the whole list in case I am wrong about some things, so others can add their thoughts)." --- I agree! This practice can also lead to better transdisciplinary understanding and exchange. It can also be hard to believe our research/tech space has come to this, but if you'd allow me to explain ---
First of all, I think most of you might know me from my public rebuttals for my ICLR2021 & 2022 submissions. For the latter, in which I decomposed "words" more explicitly, I had to really "fight" hard to convince the reviewers. That also has to do with the fact that the concept of "word" (and also "morphology") and the decades-long assumption and adoption of these in CL/NLP/Linguistics might have been too casual/imprecise/negligent of a choice and practice. As my work has shown, a mistake therein was / might have been made. Some students have been miseducated --- myself partly included, but since CL/NLP/Linguistics were not the only subject(s) that I have studied, it might have been easier for me to abandon these assumptions, but for many others, this may not have been the case. If we continue with these practices in the research space through conferences or research activities, such malpractice would be exacerbated.
Textual data can be *processed* without word tokenization or sentence segmentation [1]. One can process data in full --- in character/byte representations (depending on the task and computational resources, e.g. for pattern matching for strings, one would work with characters, for other tasks, bytes). Depending on the nature of the tasks and methods, our *evaluation and interpretation* strategies may differ. Computational neural network models are statistical models and need to be evaluated and interpreted statistically --- this is the perspective of many computer scientists and statisticians and it is correct. In the tradition of CL/NLP/Linguistics (or even many in data analytics or in digital humanities), there had been an erroneous assumption and practice that one could evaluate statistical models based on textual output only.
As with areas related to "language" outside the context of computing, e.g. Linguistics (without the use of computational tools), there are certain structural assumptions (from the past decades) that need to be refined. I have been trying to advocate the broadening of one's perspectives/interpretations of "language" to ones that are without "words", "sentences", "linguistic structure(s)", "grammar", and "p-language(s)". These concepts denote nothing universal (or determinate --- not without circularity) and the amplification of these through technology/computing can lead to unethical/unhealthy consequences. I have the impression our understanding of this (may one be a linguist, CL/NLPer, computing professional or AI-practitioner) may not be aligned.
As many disciplines/sectors are now leveraging similar/same methods, I feel that there is a responsibility to clarity this.
Last but not least, please note/notice that I have only been *responsive* to announcements with potential concerns (e.g.those scientific or ethical in nature). I did/do not proactively advertise my own work or have the intent to do so on this mailing list just for fun or to offend others.
As always, I remain open for your feedback.
Thank you for your attention.
Best regards Ada
[1] @Jonas: re "sentences": i. "sentence" is not a universal concept crosslinguistically or cross-stylistically (e.g. across genres) or across modalities (speech/signing does not occur in form of "sentences", esp. natural speech/signing); ii. even if "sentence" were defined "x-centrically" (if definable at all), where x denotes a certain style, for example; stylistic hegemony would occur, not to mention that overfitting to any one style is likely to lead to bad generalizations; iii. re "I don't think conference organizers usually make hard prescriptions on what constitutes a sentence" --- that is a problem, isn't it? There is no standardization possible either. "Sentences" are also indeterminate, esp. in the context of computing. We wouldn't want to encourage "sentence"-hacking, would we? iv. in many NLP toolkits, "sentence" often refers to "line" (as delimited by linebreaks), v. for those who have worked on data collection and curation before, esp. for parallel data, content is often aligned by line (and that can already be difficult). Thanks for your content-rich comments, btw!
On Fri, Aug 25, 2023 at 12:59 PM George Giannakopoulos < ggianna@iit.demokritos.gr> wrote:
Dear All,
I would like to warmly suggest/remind the following to all of us (as a friendly suggestion, on which I will not follow up):
- One can find online good examples for the *"netiquette"* of mailing
lists to reduce problems (see here https://www.snort.org/faq/what-is-the-mailing-list-etiquette, here https://en.opensuse.org/openSUSE:Mailing_list_netiquette and here https://sites.ualberta.ca/~pletendr/list-net.html for examples, which can be useful for all of us).
- Please, let us always remember that there is a *person with feelings
- on the other side of a communication. We need to gently and
respectfully handle cases where we have objections.
- If you feel that a conversation grows too big or is somehow
problematic, address a *personal** e-mail to a main contributor suggesting nicely an alternative* you consider more appropriate. If this fails systematically, then scale it up through a list moderator (or the list itself) politely.
- Specific *suggestions for appropriate digital spaces* that can hold
e.g. long discussions may allow all such discussion to find their own nest after a given point, so that we all have a common additional resource connected to the list, for topics that do need the added interaction.
- If you feel that a topic you contribute to really ignites
interesting conversation or if you simply receive an e-mail suggesting you to move a long conversation elsewhere due to its size, *consider an alternative* (or even ask the list for one), to facilitate the use of the mailing list itself.
- Let us remember that what is *uninteresting to us may be interesting
to others*.
As a final comment, before best practices comes *common understanding* and *good will*. Let us primarily build on these, as we have done in this list for many years.
Having said the above, I would like to thank Ada (and all the others) for the contributions (past, current and future) and discussions that keep this list alive.
Best regards, George G.
P.S. I would also like to thank Gully for trying to keep the list humane.
On 23/8/23 00:53, Gully Burns via Corpora wrote:
Dear all,
I was shocked to see a vitriolic ad-hominem attack on a colleague posted to this mailing list. It is entirely inappropriate to post this type of diatribe against an individual even though someone might disagree with either the tone or the content of an individual's messages or arguments. The fact that other members of the community chimed in to reinforce the attack is also appalling and entirely inappropriate.
Sincerely,
Gully Burns
On Tue, Aug 22, 2023 at 1:23 PM Ada Wan via Corpora < corpora@list.elra.info> wrote:
Dear all on the Corpora-List
I understand it is possible that some of you may harbor some negative sentiments towards me and/or my recent replies on the list. That having been expressed, I would like to remind everyone on this list it is important to understand that many subjects such as computational [x, where x can be e.g. linguistics, biology, physics, modeling...], digital humanities, data analytics, data science, and many of their dependencies have been / are in the public domain, much of which academic and scientific in nature. Science is in the public domain.
What we are experiencing here is sort of a computational and statistical turn in the computational sciences and studies --- anything that involves data (computational and otherwise). Previously (or even currently in many disciplines/practices), one has modeled / has been modeling many symbolic concepts and values computationally, directly inheriting these from "traditional sciences" (i.e. sciences from a time when all was done without any computational machinery), assuming that these values and the relationship between such would not only hold but also hold as the only ground truth. But as e.g. my results have shown, many of these scientific concepts, values, and relationships deserve to be re-evaluated and re-interpreted.
What I have been trying to do is to communicate this, as without any updates and/or self-correction, we could be experiencing many discrepancies in our experimental results. Good scientific practice (including good assumptions therefor) is fundamental to everyone. This includes but is not limited to having good assumptions, leveraging appropriate methods, being responsible in evaluation as well as addressing ethical concerns, e.g. in the case of my findings: a combination of false assumptions and miseducation. (Sorry to re-iterate this but it is just such an important lesson for many on this list... it may be painful for some too.)
Corpora-list might have changed more or less like how the field of CL/NLP has in the past decades. While these areas might have become more generalized and thus the audience more "diverse" in terms of background and areas of familiarity, there are certainly some on this list who are concerned about some of the "bad" science/values that could get propagated through the use of data/corpora. That is one of the reasons behind my many replies of late.
*If you should find my comments/replies an issue of concern, please let me know what in specifics you disagree with. I'd be happy to modify my formulations or discuss further. If you think I have been wrong somewhere, please do let me know. I'd be happy to update. *
Thanks and best Ada
On Mon, Aug 21, 2023 at 5:39 PM Ada Wan adawan919@gmail.com wrote:
Amendment: In short, there are no symbolic concepts relevant in computing / computational processing except for those which also align with statistics. (There are various levels of assumptions/abstractions that could be relevant depending on the goals/tasks. But much of what one might have been doing in "symbolic computing" surely deserves a critical re-examination.
On Mon, Aug 21, 2023 at 4:48 PM Ada Wan adawan919@gmail.com wrote:
Dear Ben, Rodolfo, and Toms
Please accept that there is a responsibility to science, technology, engineering, and education (or anything that we undertake).
If you could point out the specific arguments as to which of what I wrote may be problematic to you, perhaps we can have a constructive exchange. The way in which you three expressed your sentiments on this thread can be interpreted as mobbing.
Please note the intent behind my statement and lend me the benefit of a doubt as to why I would have invested my time and energy to write the reply that I did to the list: "As language sciences (e.g. Linguistics) and NLP are still taught at some universities, i.e. part of publicly accessible education, there is a general responsibility that one should bear when promoting/hosting events that would be explicitly/implicitly supporting biases and/or in violation of scientific integrity." This applies to the whole area of computing, including digital humanities and the computational social sciences.* In short, there are no symbolic concepts relevant in computing / computational processing.* I am sorry if that has not been clear.
I understand that there are members in the CL/NLP community/communities who might be interested in (or used/addicted to) "word" hacking. But it is now high time to stop.
@Ben: Please note that I am not doing this "for fun". I am not trying to ridicule anyone. My remarks are not ad personam. For each of the research directions/practices that I commented on, there are opportunities for all practitioners to do a better job, to refine our analyses.
Thanks and best Ada
On Mon, Aug 21, 2023 at 9:45 AM Toms Bergmanis via Corpora < corpora@list.elra.info> wrote:
Can’t agree more.
Toms
*From:* Rodolfo Delmonte via Corpora corpora@list.elra.info *Sent:* Monday, August 21, 2023 10:06 AM *To:* Ben Sir benoit.siroit@gmail.com *Cc:* corpora corpora@list.elra.info *Subject:* [Corpora-List] Re: RANLP 2023 Call for Participation
Fully agree with you Ben.
Rodolfo
Il lun 21 ago 2023, 01:00 Ben Sir via Corpora corpora@list.elra.info ha scritto:
Hi Ada,
It's understandable that enthusiasm can sometimes lead to excessive engagement, but your disruptive posting on the mailing list has reached an intolerable level. Please keep your conversations private instead of spamming everyone and curb your enthusiasm. Your obnoxious behavior reflects poorly on you.
Thanks. _______________________________________________ Corpora mailing list -- corpora@list.elra.info https://list.elra.info/mailman3/postorius/lists/corpora.list.elra.info/ To unsubscribe send an email to corpora-leave@list.elra.info
Nota automatica aggiunta dal sistema di posta
*Sostieni il futuro*
Dona il tuo 5x1000 al Collegio Internazionale Ca' Foscari
*FINANZIAMENTO DELLA RICERCA SCIENTIFICA E DELLA UNIVERSITÀ | CODICE FISCALE: 80007720271* _______________________________________________ Corpora mailing list -- corpora@list.elra.info https://list.elra.info/mailman3/postorius/lists/corpora.list.elra.info/ To unsubscribe send an email to corpora-leave@list.elra.info
Corpora mailing list -- corpora@list.elra.info https://list.elra.info/mailman3/postorius/lists/corpora.list.elra.info/ To unsubscribe send an email to corpora-leave@list.elra.info
Corpora mailing list -- corpora@list.elra.infohttps://list.elra.info/mailman3/postorius/lists/corpora.list.elra.info/ To unsubscribe send an email to corpora-leave@list.elra.info
--
*George Giannakopoulos, PhD*
*Researcher* Home page http://www.iit.demokritos.gr/~ggianna SKEL Lab - NCSR Demokritos http://www.iit.demokritos.gr and
*Scientific Officer* ahedd DIH - NCSR "Demokritos" https://ahedd.demokritos.gr and
*Co-founder, Chief Executive Officer* SciFY Not-for-Profit Company http://www.scify.org
Dear Ada, dear all,
I am not a linguist but a computational scientist which is quite used to talk with (and tries to understand) linguists. I must say that I usually read your mails as thoroughly as my schedule and patience allows me to, but, to be honest, I also have a rather negative feeling when reading your “discourse”.
In this discourse, I see facts + interpretation + rhetorics.
[Here I take the risk of caricaturing for the sake of shortness, I hope you will understand that I have no time nor intention to really go deeply in all the intricacies of your different claims as I am more a witness than an actor of this scientific dispute]
My understanding of your facts: Neural models do not use the concept of word in any of their tasks, but achieve very interesting results in their modelling of the language.
My understanding of your interpretation: this is the proof that there is no such thing as a word.
My understanding of your rhetoric: linguists are still using “words”, so they are wrong or dishonest or miseducated or dumb, we should wipe out entirely any occurence of this concept and start over with another modelling of the language.
Please, understand that I am just presenting the way I am interpreting your different messages. And even if I am wrong here, this interpretation is to be taken into account as we are all persons with feeling. This feeling is a fact, even if I do not particularly feel targeted by your different criticisms. I hope this will help you ponder the terms involved in your next messages.
This being said, I was not particularly surprised to see some “passionate” replies to your different messages. And I agree with everyone here, we should not go into such passion and use ad-hominem attacks on a mailing list, AND you should also understand that most of your rhetoric do contains such passion and attacks.
Concerning the facts :
You are right, Neural models does not use any notion of word (or word morphology) as it is usually thought in linguistics as it usually first decide what is the granularity with which it will aggregate its input (sequence of characters) into tokens to which it attaches an “interpretation” (modelled as a multi-dimensional vector).
Concerning the interpretation :
1. You want to wipe out the notion of word based on such a fact. I would agree somehow if we were dealing with a universal modelling of language, but this is not the case. Human model language in a certain way and neural models in another way (even if neural networks are claimed to be inspired by biological neurones in our brains). The fact that a concept does not exist in a model does not entail that it does not exist in another model.
2. Also, you do make the very same mistake concerning the way you look at the facts: i.e. there is no such thing as a character…, which means that the input of NN is already flown with a bias with which we look at language. Indeed characters are a very recent invention that builds on different concerns: - usual graphical elements that are traditionally used in language writing and that has been interpreted as atomic, - their interpretation by the encoding authorities (see the differences and debates about code points vs characters) - arbitrary decision made (e.g. why model A and a as 2 different characters?) Moreover, all corpora are usually badly encoded by using one character for another (quote instead of apostrophe, unbreakable character instead of a space, …) and this only accounts for languages with a writing system or transcription, i.e. not the majority of them.
The conclusion is that even Neural Network uses artificial bias in the way they model language, which means that the conclusion we draw from them are as flawed as the one we draw from the classical way linguists look at languages.
3. Most serious linguists never defined “words” lightly and most of them know that this concept is an "approximation” of something that is very difficult to apprehend and seems to be more grounded into linguistics from human introspection than linguistics from corpora. It somehow represents the way our human brain aggregates the atoms of the language (characters/phonemes) into something to which we associate an interpretation. In this sense, it is somehow the “tokens” of our biological neural network (and certainly far more).
As an utterance production is not a bijection between whatever we have in our head and the sequential signal we use to communicate, I agree with you on the fact that “words" are certainly not present in a corpus (but I do think that our inner “tokens” may be observed somehow there).
Concerning the rhetoric:
I do not think any linguist or computational linguist is naive enough to think that any of the modelling we deal with are a “truth” and I doubt any of them is miseducated enough to think that “words” are clearly defined and undoubtedly present in corpora. I do think though that they are usually right to observe occurrences (or hints) of non atomic constructs we associate with some interpretation. I also think that this way of looking to a corpus has some advantages that are not really present in NN (for instance, it can observe some regularity that will help human produce new utterances without being shown a large amount of examples).
I also do think that even if you were totally right in your facts and interpretations, asking for a denial of current/past ways of looking to the texts will be a mistake. Even in physics, since the general theory of relativity, we know the classical mechanics is wrong, however it is still in use and it is not a problem as long as everybody know under which hypothesis it is a good enough approximation and under which hypothesis it does not work anymore.
I know this message will certainly not make you think differently, but if it allows you to communicate differently with persons that still use the terms “words" or “sentences" as a simple shortcut to position their work into a shared/common understanding of the state of the art, in contexts where there is no room for better explanation (e.g. in summaries of their keynote speech), then I will have achieved something.
Hoping this scientifical debate will continue in an appeased manner,
Regards,
Gilles Sérasset,
Dear Ada, dear all, I'm a bit concerned with what has been going with the list recently. The list, as far as I understand, serves several purposes. One of them is purely informative, where one informs the community about potentially interesting jobs, conferences etc. If I open an answer to a job advertisment, I expect it will be a question useful for the potential applicants or correction about, for example, deadlines.
Another thing is to ask questions or start some discussions on various topics, either theoretical or purely practical. There I will expect people sharing their experience and opinions.
What I do not find ok, is giving the feedback to purely informational posts in the way Ada does. In my opinion the discussions whether words or sentences are up-to-date concepts in any (general)linguistic or computational linguistic framework should be led in separate threads. (Notice also that the problem of text segmentation has been topic for already long time.) Summing up, I wouldn't mind if Adas comments were presented maybe privately to the authors of posts, or discussed in separate list-mails. Otherwise, we are facing chaos here.
Summing up, I would be more than happy to participate, if discussions about the relation between linguistics and NLP took place, but not mixed with advertisments.
I hope I did not offend anybody with this message. Best, Edyta Jurkiewicz-Rohrbacher
śr., 30 sie 2023 o 16:35 Gilles Sérasset via Corpora corpora@list.elra.info napisał(a):
Dear Ada, dear all,
I am not a linguist but a computational scientist which is quite used to talk with (and tries to understand) linguists. I must say that I usually read your mails as thoroughly as my schedule and patience allows me to, but, to be honest, I also have a rather negative feeling when reading your “discourse”.
In this discourse, I see facts + interpretation + rhetorics.
[Here I take the risk of caricaturing for the sake of shortness, I hope you will understand that I have no time nor intention to really go deeply in all the intricacies of your different claims as I am more a witness than an actor of this scientific dispute]
My understanding of your facts: Neural models do not use the concept of word in any of their tasks, but achieve very interesting results in their modelling of the language.
My understanding of your interpretation: this is the proof that there is no such thing as a word.
My understanding of your rhetoric: linguists are still using “words”, so they are wrong or dishonest or miseducated or dumb, we should wipe out entirely any occurence of this concept and start over with another modelling of the language.
Please, understand that I am just presenting the way I am interpreting your different messages. And even if I am wrong here, this interpretation is to be taken into account as we are all persons with feeling. This feeling is a fact, even if I do not particularly feel targeted by your different criticisms. I hope this will help you ponder the terms involved in your next messages.
This being said, I was not particularly surprised to see some “passionate” replies to your different messages. And I agree with everyone here, we should not go into such passion and use ad-hominem attacks on a mailing list, AND you should also understand that most of your rhetoric do contains such passion and attacks.
Concerning the facts :
You are right, Neural models does not use any notion of word (or word morphology) as it is usually thought in linguistics as it usually first decide what is the granularity with which it will aggregate its input (sequence of characters) into tokens to which it attaches an “interpretation” (modelled as a multi-dimensional vector).
Concerning the interpretation :
You want to wipe out the notion of word based on such a fact. I would agree somehow if we were dealing with a universal modelling of language, but this is not the case. Human model language in a certain way and neural models in another way (even if neural networks are claimed to be inspired by biological neurones in our brains). The fact that a concept does not exist in a model does not entail that it does not exist in another model.
Also, you do make the very same mistake concerning the way you look at the facts: i.e. there is no such thing as a character…, which means that the input of NN is already flown with a bias with which we look at language. Indeed characters are a very recent invention that builds on different concerns:
- usual graphical elements that are traditionally used in language writing and that has been interpreted as atomic,
- their interpretation by the encoding authorities (see the differences and debates about code points vs characters)
- arbitrary decision made (e.g. why model A and a as 2 different characters?)
Moreover, all corpora are usually badly encoded by using one character for another (quote instead of apostrophe, unbreakable character instead of a space, …) and this only accounts for languages with a writing system or transcription, i.e. not the majority of them.
The conclusion is that even Neural Network uses artificial bias in the way they model language, which means that the conclusion we draw from them are as flawed as the one we draw from the classical way linguists look at languages.
- Most serious linguists never defined “words” lightly and most of them know that this concept is an "approximation” of something that is very difficult to apprehend and seems to be more grounded into linguistics from human introspection than linguistics from corpora. It somehow represents the way our human brain aggregates the atoms of the language (characters/phonemes) into something to which we associate an interpretation. In this sense, it is somehow the “tokens” of our biological neural network (and certainly far more).
As an utterance production is not a bijection between whatever we have in our head and the sequential signal we use to communicate, I agree with you on the fact that “words" are certainly not present in a corpus (but I do think that our inner “tokens” may be observed somehow there).
Concerning the rhetoric:
I do not think any linguist or computational linguist is naive enough to think that any of the modelling we deal with are a “truth” and I doubt any of them is miseducated enough to think that “words” are clearly defined and undoubtedly present in corpora. I do think though that they are usually right to observe occurrences (or hints) of non atomic constructs we associate with some interpretation. I also think that this way of looking to a corpus has some advantages that are not really present in NN (for instance, it can observe some regularity that will help human produce new utterances without being shown a large amount of examples).
I also do think that even if you were totally right in your facts and interpretations, asking for a denial of current/past ways of looking to the texts will be a mistake. Even in physics, since the general theory of relativity, we know the classical mechanics is wrong, however it is still in use and it is not a problem as long as everybody know under which hypothesis it is a good enough approximation and under which hypothesis it does not work anymore.
I know this message will certainly not make you think differently, but if it allows you to communicate differently with persons that still use the terms “words" or “sentences" as a simple shortcut to position their work into a shared/common understanding of the state of the art, in contexts where there is no room for better explanation (e.g. in summaries of their keynote speech), then I will have achieved something.
Hoping this scientifical debate will continue in an appeased manner,
Regards,
Gilles Sérasset,
Corpora mailing list -- corpora@list.elra.info https://list.elra.info/mailman3/postorius/lists/corpora.list.elra.info/ To unsubscribe send an email to corpora-leave@list.elra.info
Dear All,
I agree with Edyta's polite remarks.
I find the discussions below purely informative posts quite confusing, and I am "losing track" of the original posts to the point that I fear I might miss calls that could be relevant for my work, or miss discussions that are worth joining. Before Edyta's remarks I was even considering leaving the list because of the current situation in the list.
So, I join Edyta's kind request to keep discussions as separate threads and leave call for papers/abstracts or job calls as purely informative posts. Perhaps opening a new, separate discussion thread might be an alternative option that would allow us to filter the different kinds of communications we received from the list.
Best wishes to everyone, Daniela Cesiri
Il Mer 30 Ago 2023, 17:15 Edyta Jurkiewicz-Rohrbacher via Corpora < corpora@list.elra.info> ha scritto:
Dear Ada, dear all, I'm a bit concerned with what has been going with the list recently. The list, as far as I understand, serves several purposes. One of them is purely informative, where one informs the community about potentially interesting jobs, conferences etc. If I open an answer to a job advertisment, I expect it will be a question useful for the potential applicants or correction about, for example, deadlines.
Another thing is to ask questions or start some discussions on various topics, either theoretical or purely practical. There I will expect people sharing their experience and opinions.
What I do not find ok, is giving the feedback to purely informational posts in the way Ada does. In my opinion the discussions whether words or sentences are up-to-date concepts in any (general)linguistic or computational linguistic framework should be led in separate threads. (Notice also that the problem of text segmentation has been topic for already long time.) Summing up, I wouldn't mind if Adas comments were presented maybe privately to the authors of posts, or discussed in separate list-mails. Otherwise, we are facing chaos here.
Summing up, I would be more than happy to participate, if discussions about the relation between linguistics and NLP took place, but not mixed with advertisments.
I hope I did not offend anybody with this message. Best, Edyta Jurkiewicz-Rohrbacher
śr., 30 sie 2023 o 16:35 Gilles Sérasset via Corpora corpora@list.elra.info napisał(a):
Dear Ada, dear all,
I am not a linguist but a computational scientist which is quite used to
talk with (and tries to understand) linguists. I must say that I usually read your mails as thoroughly as my schedule and patience allows me to, but, to be honest, I also have a rather negative feeling when reading your “discourse”.
In this discourse, I see facts + interpretation + rhetorics.
[Here I take the risk of caricaturing for the sake of shortness, I hope
you will understand that I have no time nor intention to really go deeply in all the intricacies of your different claims as I am more a witness than an actor of this scientific dispute]
My understanding of your facts: Neural models do not use the concept of
word in any of their tasks, but achieve very interesting results in their modelling of the language.
My understanding of your interpretation: this is the proof that there is
no such thing as a word.
My understanding of your rhetoric: linguists are still using “words”, so
they are wrong or dishonest or miseducated or dumb, we should wipe out entirely any occurence of this concept and start over with another modelling of the language.
Please, understand that I am just presenting the way I am interpreting
your different messages. And even if I am wrong here, this interpretation is to be taken into account as we are all persons with feeling. This feeling is a fact, even if I do not particularly feel targeted by your different criticisms. I hope this will help you ponder the terms involved in your next messages.
This being said, I was not particularly surprised to see some
“passionate” replies to your different messages. And I agree with everyone here, we should not go into such passion and use ad-hominem attacks on a mailing list, AND you should also understand that most of your rhetoric do contains such passion and attacks.
Concerning the facts :
You are right, Neural models does not use any notion of word (or word
morphology) as it is usually thought in linguistics as it usually first decide what is the granularity with which it will aggregate its input (sequence of characters) into tokens to which it attaches an “interpretation” (modelled as a multi-dimensional vector).
Concerning the interpretation :
- You want to wipe out the notion of word based on such a fact. I would
agree somehow if we were dealing with a universal modelling of language, but this is not the case. Human model language in a certain way and neural models in another way (even if neural networks are claimed to be inspired by biological neurones in our brains). The fact that a concept does not exist in a model does not entail that it does not exist in another model.
- Also, you do make the very same mistake concerning the way you look
at the facts: i.e. there is no such thing as a character…, which means that the input of NN is already flown with a bias with which we look at language. Indeed characters are a very recent invention that builds on different concerns:
- usual graphical elements that are traditionally used in language
writing and that has been interpreted as atomic,
- their interpretation by the encoding authorities (see the differences
and debates about code points vs characters)
- arbitrary decision made (e.g. why model A and a as 2 different
characters?)
Moreover, all corpora are usually badly encoded by using one character
for another (quote instead of apostrophe, unbreakable character instead of a space, …) and this only accounts for languages with a writing system or transcription, i.e. not the majority of them.
The conclusion is that even Neural Network uses artificial bias in the
way they model language, which means that the conclusion we draw from them are as flawed as the one we draw from the classical way linguists look at languages.
- Most serious linguists never defined “words” lightly and most of them
know that this concept is an "approximation” of something that is very difficult to apprehend and seems to be more grounded into linguistics from human introspection than linguistics from corpora. It somehow represents the way our human brain aggregates the atoms of the language (characters/phonemes) into something to which we associate an interpretation. In this sense, it is somehow the “tokens” of our biological neural network (and certainly far more).
As an utterance production is not a bijection between whatever we have
in our head and the sequential signal we use to communicate, I agree with you on the fact that “words" are certainly not present in a corpus (but I do think that our inner “tokens” may be observed somehow there).
Concerning the rhetoric:
I do not think any linguist or computational linguist is naive enough to
think that any of the modelling we deal with are a “truth” and I doubt any of them is miseducated enough to think that “words” are clearly defined and undoubtedly present in corpora. I do think though that they are usually right to observe occurrences (or hints) of non atomic constructs we associate with some interpretation. I also think that this way of looking to a corpus has some advantages that are not really present in NN (for instance, it can observe some regularity that will help human produce new utterances without being shown a large amount of examples).
I also do think that even if you were totally right in your facts and
interpretations, asking for a denial of current/past ways of looking to the texts will be a mistake. Even in physics, since the general theory of relativity, we know the classical mechanics is wrong, however it is still in use and it is not a problem as long as everybody know under which hypothesis it is a good enough approximation and under which hypothesis it does not work anymore.
I know this message will certainly not make you think differently, but
if it allows you to communicate differently with persons that still use the terms “words" or “sentences" as a simple shortcut to position their work into a shared/common understanding of the state of the art, in contexts where there is no room for better explanation (e.g. in summaries of their keynote speech), then I will have achieved something.
Hoping this scientifical debate will continue in an appeased manner,
Regards,
Gilles Sérasset,
Corpora mailing list -- corpora@list.elra.info https://list.elra.info/mailman3/postorius/lists/corpora.list.elra.info/ To unsubscribe send an email to corpora-leave@list.elra.info
Corpora mailing list -- corpora@list.elra.info https://list.elra.info/mailman3/postorius/lists/corpora.list.elra.info/ To unsubscribe send an email to corpora-leave@list.elra.info
Dear Ada, I have no negative feelings towards your opinions and the way you communicate them!
Together with the shift of NLP towards number crunching, we are experiencing also the deacademization of this scientific area.
In the past we didnt have the industry players, therefore tweaking the POS taggers and parsers to squeeze 0.5% was considered a scientific achievement. And it is, let it be frank to ourselves, we used a lot of intellectual energy to have this seemingly small achievements but on the way we have been discovering new methods and exploring new scientific ideas.
But OpenAI showed us another prospective and lets spit it out: when a genuine idea about language representation is boosted by the calculation power of modern day CPUs and fast memory chips, breakthroughs do happen. And this leads us into something qualitatively new.
The point is, how these two worlds will merge.
Hristo Tanev, Researcher and scientific project officer Joint Research Centre, EC
On Tue, Aug 22, 2023, 22:23 Ada Wan via Corpora corpora@list.elra.info wrote:
Dear all on the Corpora-List
I understand it is possible that some of you may harbor some negative sentiments towards me and/or my recent replies on the list. That having been expressed, I would like to remind everyone on this list it is important to understand that many subjects such as computational [x, where x can be e.g. linguistics, biology, physics, modeling...], digital humanities, data analytics, data science, and many of their dependencies have been / are in the public domain, much of which academic and scientific in nature. Science is in the public domain.
What we are experiencing here is sort of a computational and statistical turn in the computational sciences and studies --- anything that involves data (computational and otherwise). Previously (or even currently in many disciplines/practices), one has modeled / has been modeling many symbolic concepts and values computationally, directly inheriting these from "traditional sciences" (i.e. sciences from a time when all was done without any computational machinery), assuming that these values and the relationship between such would not only hold but also hold as the only ground truth. But as e.g. my results have shown, many of these scientific concepts, values, and relationships deserve to be re-evaluated and re-interpreted.
What I have been trying to do is to communicate this, as without any updates and/or self-correction, we could be experiencing many discrepancies in our experimental results. Good scientific practice (including good assumptions therefor) is fundamental to everyone. This includes but is not limited to having good assumptions, leveraging appropriate methods, being responsible in evaluation as well as addressing ethical concerns, e.g. in the case of my findings: a combination of false assumptions and miseducation. (Sorry to re-iterate this but it is just such an important lesson for many on this list... it may be painful for some too.)
Corpora-list might have changed more or less like how the field of CL/NLP has in the past decades. While these areas might have become more generalized and thus the audience more "diverse" in terms of background and areas of familiarity, there are certainly some on this list who are concerned about some of the "bad" science/values that could get propagated through the use of data/corpora. That is one of the reasons behind my many replies of late.
*If you should find my comments/replies an issue of concern, please let me know what in specifics you disagree with. I'd be happy to modify my formulations or discuss further. If you think I have been wrong somewhere, please do let me know. I'd be happy to update. *
Thanks and best Ada
On Mon, Aug 21, 2023 at 5:39 PM Ada Wan adawan919@gmail.com wrote:
Amendment: In short, there are no symbolic concepts relevant in computing / computational processing except for those which also align with statistics. (There are various levels of assumptions/abstractions that could be relevant depending on the goals/tasks. But much of what one might have been doing in "symbolic computing" surely deserves a critical re-examination.
On Mon, Aug 21, 2023 at 4:48 PM Ada Wan adawan919@gmail.com wrote:
Dear Ben, Rodolfo, and Toms
Please accept that there is a responsibility to science, technology, engineering, and education (or anything that we undertake).
If you could point out the specific arguments as to which of what I wrote may be problematic to you, perhaps we can have a constructive exchange. The way in which you three expressed your sentiments on this thread can be interpreted as mobbing.
Please note the intent behind my statement and lend me the benefit of a doubt as to why I would have invested my time and energy to write the reply that I did to the list: "As language sciences (e.g. Linguistics) and NLP are still taught at some universities, i.e. part of publicly accessible education, there is a general responsibility that one should bear when promoting/hosting events that would be explicitly/implicitly supporting biases and/or in violation of scientific integrity." This applies to the whole area of computing, including digital humanities and the computational social sciences.* In short, there are no symbolic concepts relevant in computing / computational processing.* I am sorry if that has not been clear.
I understand that there are members in the CL/NLP community/communities who might be interested in (or used/addicted to) "word" hacking. But it is now high time to stop.
@Ben: Please note that I am not doing this "for fun". I am not trying to ridicule anyone. My remarks are not ad personam. For each of the research directions/practices that I commented on, there are opportunities for all practitioners to do a better job, to refine our analyses.
Thanks and best Ada
On Mon, Aug 21, 2023 at 9:45 AM Toms Bergmanis via Corpora < corpora@list.elra.info> wrote:
Can’t agree more.
Toms
*From:* Rodolfo Delmonte via Corpora corpora@list.elra.info *Sent:* Monday, August 21, 2023 10:06 AM *To:* Ben Sir benoit.siroit@gmail.com *Cc:* corpora corpora@list.elra.info *Subject:* [Corpora-List] Re: RANLP 2023 Call for Participation
Fully agree with you Ben.
Rodolfo
Il lun 21 ago 2023, 01:00 Ben Sir via Corpora corpora@list.elra.info ha scritto:
Hi Ada,
It's understandable that enthusiasm can sometimes lead to excessive engagement, but your disruptive posting on the mailing list has reached an intolerable level. Please keep your conversations private instead of spamming everyone and curb your enthusiasm. Your obnoxious behavior reflects poorly on you.
Thanks. _______________________________________________ Corpora mailing list -- corpora@list.elra.info https://list.elra.info/mailman3/postorius/lists/corpora.list.elra.info/ To unsubscribe send an email to corpora-leave@list.elra.info
Nota automatica aggiunta dal sistema di posta
*Sostieni il futuro*
Dona il tuo 5x1000 al Collegio Internazionale Ca' Foscari
*FINANZIAMENTO DELLA RICERCA SCIENTIFICA E DELLA UNIVERSITÀ | CODICE FISCALE: 80007720271* _______________________________________________ Corpora mailing list -- corpora@list.elra.info https://list.elra.info/mailman3/postorius/lists/corpora.list.elra.info/ To unsubscribe send an email to corpora-leave@list.elra.info
Corpora mailing list -- corpora@list.elra.info https://list.elra.info/mailman3/postorius/lists/corpora.list.elra.info/ To unsubscribe send an email to corpora-leave@list.elra.info