Dear Johannes

Re "CreoleVal": 
at this point, it's more like a "one shouldn't" as opposed to whether "one can". 

The following is what I wrote to the SIGTYP, I think the message would be similar for your initiative: 

"""
---------- Forwarded message ---------
From: Ada Wan <adawan919@gmail.com>
Date: Tue, Oct 31, 2023 at 6:47 PM
Subject: Re: [Corpora-List] First CFP: The 6th Workshop on Research in Computational Linguistic Typology and Multilingual NLP (SIGTYP 2024)
To: Michael Hahn <mhahn@lst.uni-saarland.de>, <sigtyp@gmail.com>
Cc: <corpora@list.elra.info>


Dear Michael, dear SIGTYP officers and workshop organizers

I saw this posting of yours and have some concerns re the orientation of this workshop/event. Given the work by Mielke et al. (2019) and Wan (2022), I am surprised to see how the workshop description seems not to have been updated accordingly.
I have some questions:
i. would/could such event/initiatives contribute to misinforming academics, professionals, and practitioners (including those who may be new to the topic)?
ii. at what granularity (e.g. "word", character, or byte) will "linguistic typology" be promoted through this workshop/event?
iii. what is/are the "discipline-specific narrative(s)" (default expectations of a discipline), if any, that is/are supposed to hold still, esp. after the 2 publications mentioned above?
iv. how is "language" defined for the aim(s)/purpose(s) of your workshop? and
v. since the initiatives of the workshop are computing-related, is character encoding (an area that has been severely overlooked in the past in Computational Linguistics / Natural Language Processing) being used/promoted/introduced?

One major ethical consideration in the area of "linguistic typology" is that it could unnecessarily exacerbate differences between language varieties, esp. if/when such differences are not observable unless one creates them through "word" (or "word"-like) tokenization in the preprocessing step. It would be a violation of scientific integrity if one were to continue "word"-hacking (in another formulation: intentionally discarding data) in the name of "linguistic typology", would you not agree?

I look forward to your replies.

Thanks and best
Ada
"""

Thanks and best
Ada


On Tue, Oct 31, 2023 at 8:59 PM Johannes Bjerva via Corpora <corpora@list.elra.info> wrote:
We are proud to announce the release of CreoleVal - a collection of benchmarks for 28 Creole languages. The collection of datasets span tasks such as relation classification, machine comprehension, machine translation, named entity recognition, and use cases such as language modeling. We cover Haitian Creole, Bislama, Chavacano, Pitkern, Singlish, Tok Pisin, Papiamento, and others.

We hope the NLP community will include this collection of datasets in ongoing & future evaluations of methods directed at low-resource languages. Not only that, we also hypothesise that CreoleVal will open the door for controlled experimentation with transfer learning methodology.

This resource has been long in the making, and was made possible by a long list of collaborators. 

For a pre-print, see: https://arxiv.org/abs/2310.19567

For code and data, see: https://github.com/hclent/CreoleVal
(Repository under construction)

_______________________________________________
Corpora mailing list -- corpora@list.elra.info
https://list.elra.info/mailman3/postorius/lists/corpora.list.elra.info/
To unsubscribe send an email to corpora-leave@list.elra.info