In this newsletter:
LDC membership discounts expire March 1
30th Anniversary Highlight: Arabic Treebank
New publications:
2019 NIST Speaker Recognition Evaluation Test Set - Audio-Visual<https://catalog.ldc.upenn.edu/LDC2023V01>
LORELEI Tagalog Representative Language Pack<https://catalog.ldc.upenn.edu/LDC2023T02>
________________________________
LDC membership discounts expire March 1
Time is running out to save on 2023 membership fees. Renew your LDC membership, rejoin the Consortium, or become a new member by March 1 to receive a discount of up to 10%. For more information on membership benefits and options, visit Join LDC<https://www.ldc.upenn.edu/members/join-ldc>.
30th Anniversary Highlight: Arabic Treebank
The Penn/LDC Arabic Treebank (ATB) project began in 2001 with support from the DARPA TIDES program and later, the DARPA GALE and BOLT programs. The original focus was on Modern Standard Arabic (MSA), not natively spoken and not homogenously acquired across its writing and reading community. In addition to the expected issues associated with complex data annotation, LDC encountered several challenges unique to a highly inflected language with a rich history of traditional grammar. LDC relied on traditional Arabic grammar, as well as established and modern grammatical theories of MSA -- in combination with the Penn Treebank approach to syntactic annotation -- to design an annotation system for Arabic. (Maamouri, et al., 2004<https://www.ldc.upenn.edu/sites/www.ldc.upenn.edu/files/nemlar2004-penn-ara…>). LDC was innovative with respect to traditional grammar when necessary and when other syntactic approaches were found to account for the data. LDC also developed a wide-coverage MSA morphological analyzer, LDC Standard Arabic Morphological Analyzer (SAMA) Version 3.1 (LDC2010L01<https://catalog.ldc.upenn.edu/LDC2010L01>), which greatly benefited ATB development. Revisions to the annotation guidelines during the DARPA GALE program (principally related to tokenization and syntactic annotation) improved inter-annotator agreement and parsing scores.
ATB corpora were annotated for morphology, part-of-speech, gloss, and syntactic structure. Data sets based on MSA newswire developed under the revised annotation guidelines include Arabic Treebank: Part 1 v 4.1 (LDC2010T13<https://catalog.ldc.upenn.edu/LDC2010T13>), Arabic Treebank: Part 2 v 3.1 (LDC0211T09<https://catalog.ldc.upenn.edu/LDC2011T09>), and Arabic Treebank: Part 3 v 3.2 (LDC2010T08<https://catalog.ldc.upenn.edu/LDC2010T08>). Other genres are represented in Arabic Treebank - Broadcast News v 1.0 (LDC2012T07<https://catalog.ldc.upenn.edu/LDC2012T07>) and Arabic Treebank - Weblog (LDC2016T02<https://catalog.ldc.upenn.edu/LDC2016T02>).
LDC's later work on Egyptian Arabic treebanks in the DARPA BOLT program benefited from the strides in its MSA treebank annotation pipeline. As for the challenges presented by informal, dialectal material, collaborator Columbia University provided a normalized Arabic orthography to account for instances of Romanized script (Arabizi) in the data and developed a morphological analyzer (CALIMA) in parallel, working in a tight feedback loop with LDC's annotation team. SAMA and CALIMA were synchronized in the Egyptian Arabic treebanks, the former used for MSA tokens and the latter used for Egyptian Arabic tokens. Resulting corpora include BOLT Egyptian Arabic Treebank - Discussion Forum (LDC2018T23<https://catalog.ldc.upenn.edu/LDC2018T23>), Conversational Telephone Speech (LDC2021T12<https://catalog.ldc.upenn.edu/LDC2021T12>), and SMS/Chat (LDC2021T17<https://catalog.ldc.upenn.edu/LDC2021T17>).
ATB corpora and its related releases are available for licensing to LDC members and nonmembers. For more information about licensing LDC data, visit Obtaining Data<https://www.ldc.upenn.edu/language-resources/data/obtaining>
________________________________
New publications:
2019 NIST Speaker Recognition Evaluation Test Set - Audio-Visual<https://catalog.ldc.upenn.edu/LDC2023V01> contains approximately 64 hours of English audio-visual data for development and test, answer keys, enrollment, trial files, and documentation from the NIST-sponsored 2019 Speaker Recognition Evaluation (SRE)<https://www.nist.gov/itl/iad/mig/nist-2019-speaker-recognition-evaluation>.
The 2019 evaluation task was speaker detection, that is, to determine whether a specified target speaker was speaking during a segment of speech. The evaluation was conducted in two parts: (1) a leaderboard-style challenge based on conversational telephone speech and (2) a separate evaluation using audio-visual data. This release relates to the audio-visual evaluation.
The source audio-visual data was collected by LDC for the VAST (Video Annotation for Speech Technology) project. That collection focused on amateur video recordings from various online media hosting services. The recordings vary in duration from 17.5 seconds to 13 minutes; most have two audio channels (stereo), but some are monophonic (one channel).
2023 members can access this corpus through their LDC accounts. Non-members may license this data for a fee.
*
LORELEI Tagalog Representative Language Pack<https://catalog.ldc.upenn.edu/LDC2023T02> was developed by LDC and is comprised of approximately 4.8 million words of Tagalog monolingual text, 341,000 words of found Tagalog-English parallel text, and 124,000 Tagalog words translated from English data. Approximately 78,000 words were annotated for named entities and over 26,000 words were annotated for entity discovery and linking and situation frames (identifying entities, needs and issues). Data was collected from discussion forum, news, reference, social network, and weblogs.
The LORELEI (Low Resource Languages for Emergent Incidents) program was concerned with building human language technology for low resource languages in the context of emergent situations. Representative languages were selected to provide broad typological coverage.
The knowledge base for entity linking annotation is available separately as LORELEI Entity Detection and Linking Knowledge Base (LDC2020T10)<https://catalog.ldc.upenn.edu/LDC2020T10>.
2023 members can access this corpus through their LDC accounts. Non-members may license this data for a fee.
To unsubscribe from this newsletter, log in to your LDC account<https://catalog.ldc.upenn.edu/login> and uncheck the box next to "Receive Newsletter" under Account Options or contact LDC for assistance.
Membership Coordinator
Linguistic Data Consortium<ldc.upenn.edu>
University of Pennsylvania
T: +1-215-573-1275
E: ldc(a)ldc.upenn.edu<mailto:ldc@ldc.upenn.edu>
M: 3600 Market St. Suite 810
Philadelphia, PA 19104
Dear colleagues,
We are further extending the deadline for this call to the 31st of March. Please find below the updated call:
-------------
We invite researchers in the broad area of computational morphology to submit their recent, unpublished work to a special issue of the Journal of Language Modelling <https://jlm.ipipan.waw.pl/index.php/JLM><https://jlm.ipipan.waw.pl/index.php/JLM>.
Motivation:
Computational techniques have a long history of use in the study of morphology, where they have been used both for practical tasks such as the analysis and production of complex word forms and for theoretical ones such as structural and informational analysis of morphological systems. As both systems and datasets improve, these techniques are increasingly developed and evaluated on a typologically diverse array of languages, including many which are endangered or lack large-scale resources. Detailed comparisons across languages can help to reveal typological biases or assumptions within existing computational techniques [1, 2]. Alternatively, computational methods and analyses can also shed light on questions within linguistic typology [3, 4, 5, 6].
The goal of this special issue is to bring researchers from multiple communities together in exploring issues of linguistic typology across a wide range of different languages and phenomena. We encourage the submission of work on endangered or less-studied languages.
The Journal of Language Modelling is a free (for readers and authors alike) open-access peer-reviewed journal. All articles are peer-reviewed by at least 3 reviewers, usually including at least one member of the Editorial Board.
Topics of interest:
- Typological clustering or classification of languages
- Investigation of particular linguistic features which improve or detract from the performance of computational morphology tools
- Comparison of morphological structures (e.g., inflection classes, implicative networks) across typologically different languages
- Investigation of diachronic typological change using computational methods
- Creation, curation or analysis of typological databases via computational methods
Submissions:
The submissions should be journal papers, not proceedings papers, totalling 25-50 pages, excluding references.
Authors are advised to use the online manuscript submission for the journal. Make sure to select the special issue when asked to provide the article type. More information, including formatting instructions for authors can be found on the journal's webpage at: https://jlm.ipipan.waw.pl/index.php/JLM/about/submissions. An adaptation of the LaTeX template for overleaf can be found at: https://fr.overleaf.com/latex/templates/template-for-journal-of-language-mo….
Important dates:
Call for papers issued: 15/7/2022
Submissions due: 15/1/2023 --- extended to 31/03/2023
Author notification: Spring 2023
Guest editors:
Sacha Beniamine (University of Surrey)
Micha Elsner (The Ohio State University)
Katharina Kann (University of Colorado, Boulder)
References
[1] Ryan Cotterell, Christo Kirov, John Sylak-Glassman, David Yarowsky, Jason Eisner, and Mans Hulden. 2016a. The SIGMORPHON 2016 shared Task— Morphological reinflection. In Proceedings of the 14th SIGMORPHON Workshop on Computational Research in Phonetics, Phonology, and Morphology, pages 10–22, Berlin, Germany. Association for Computational Linguistics.
[2] Huiming Jin, Liwei Cai, Yihui Peng, Chen Xia, Arya McCarthy, and Katharina Kann. 2020. Unsupervised morphological paradigm completion. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 6696– 6707, Online. Association for Computational Linguistics.
[3] Neil Rathi, Michael Hahn, and Richard Futrell. 2021. An Information-Theoretic Characterization of Morphological Fusion. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 10115–10120, Online and Punta Cana, Dominican Republic. Association for Computational Linguistics.
[4] Parker, J., Reynolds, R., & Sims, A. (2022). Network Structure and Inflection Class Predictability: Modeling the Emergence of Marginal Detraction. In A. Sims, A. Ussishkin, J. Parker, & S. Wray (Eds.), Morphological Diversity and Linguistic Cognition (pp. 247-281). Cambridge: Cambridge University Press. DOI: 10.1017/9781108807951.010
[5] Guzmán Naranjo, Matías and Becker, Laura. Statistical bias control in typology. Linguistic Typology, to appear, 2021. DOI: 10.1515/lingty-2021-0002
[6] Sacha Beniamine. 2021. One lexeme, many classes: Inflection class systems as lattices. In Berthold Crysmann & Manfred Sailer (eds.), One-to-many relations in morphology, syntax, and semantics, 23--51. Berlin: Language Science Press. DOI: 10.5281/zenodo.4729789
Senior Research Associate at the Alan Turing Institute, Foundation Models and Commonsense Reasoning
Position
In 2022, the Alan Turing Institute signalled its intention to establish a portfolio of foundational AI research, which would complement the strengths of the institute around applications of AI and AI policy. An initial portfolio of research in foundation models, game theory, and probabilistic programming will be launched in early 2023. Each of these areas is called a ‘Pillar’. It is intended that this portfolio will complement the UK’s current activity, rather than duplicating existing efforts, and aiming to promote emerging new areas that show promise for the future.
FOUNDATION MODELS
Foundation models are large ML models trained on large, broad data sets. Foundation models such as GPT-3 have been shown to have remarkable capabilities for generating realistic natural language, and, to some extent, capabilities for problem solving and common-sense reasoning. Developing a Turing Foundation Model is beyond our present capacity. Instead we therefore propose work aimed at developing Turing expertise around the problem of precisely understanding the capabilities of such models. The main issue we aim to address is that of *benchmarking* such models: although such models appear to be very capable in some respects, they fail on apparently simple tasks, in unpredictable ways. In short, we don't have a clear understanding of the capabilities and shortcomings of such systems - which raises concerns for their use.
ROLE PURPOSE
We are looking for a Senior Research Associate to support and enable the delivery of the Foundation Model theme, under the direction of Anthony (Tony) Cohn, and in collaboration with Michael Wooldridge and Nigel Shadbolt.
The successful candidate will primarily focus on evaluating the extent to which, and the conditions under which existing Foundational Models can support common-sense reasoning (such as naive physics, spatial and temporal reasoning, the concept of agency, and causality). This work will be taken in parallel and in collaboration with researchers working on other aspects of the Foundation Models Pillar.
The candidate will join a vibrant team of researchers and will have opportunities to engage with cutting-edge projects and experts at leading universities.
The post can be based either at The Alan Turing Institute site in London, or at the University of Leeds. In either case you will need to travel to the other site when required (travel expenses will be paid as appropriate).
Further details and the online application form can be found here:
https://cezanneondemand.intervieweb.it/turing/jobs/senior-research-associat…<https://eur03.safelinks.protection.outlook.com/?url=https%3A%2F%2Fcezanneon…>
====
SEMANTiCS - 19th International Conference on Semantic Systems
Leipzig, Germany
Workshops and Tutorials
September 20 - 22, 2023
https://2023-eu.semantics.cc/page/cfp_ws
====
SEMANTiCS 2023 is a major venue for research and industrial innovation
and features a workshop and tutorial program addressing the diverse
practical interests of its audience. This program is intended to offer a
rich diversity of topics to conference attendees and local participants
seeking to pick up new skills and stay up-to-date regarding the latest
developments in the community. We encourage submissions of proposals on
all topics in the general areas of SEMANTiCS 2023 and proposals bridging
or introducing new perspectives in these areas. Workshops and tutorials
may incorporate panel discussions, lightning talks, meetings, networking
or hands-on sessions, hackathons and other practical formats where
applicable. Rooms for business or project meetings are available upon
request as well.
=Important Dates for Workshops=
* Proposals WS Deadline: March 07, 2023 (11:59 pm, Hawaii time)
* Notification of Acceptance: March 14, 2023 (11:59 pm, Hawaii time)
=Important Dates for Tutorials (and other meetings, e.g. seminars,
show-cases, etc., without call for papers)=
* Proposals Tutorial Deadline: June 06, 2023 (11:59 pm, Hawaii time)
* Notification of Acceptance: June 20, 2023 (11:59 pm, Hawaii time)
Submission via Easychair on https://easychair.org/conferences/?conf=sem23
=Scope & Goals=
Workshops and tutorials at SEMANTiCS 2023 allow your organisation or
project to advance and promote your topics and gain increased
visibility. The workshops and tutorials will be announced on the
SEMANTiCS website and they will be seen by all participants. SEMANTiCS
2023 workshops and tutorials can be incubators for industrial and
scientific communities that form and share a particular research and
development agenda. They provide a forum for presenting contributions
and findings to a diverse and knowledgeable community.
Furthermore, the event can be used as a dissemination activity in the
scope of large research projects or as a closed format for
research/commercial project consortia meetings.
=Setup and Requirements=
SEMANTiCS 2023 workshops and tutorials may be either half or full day
long. Workshops and tutorials take place on the days before and/or after
the main SEMANTiCS 2023 EU conference (20th, 21st, and/or 22nd of
September 2023). Details will be communicated on time.
Organizers of workshops and tutorials will be granted three free tickets
(only for the workshop & tutorial day) for organization purposes or
keynotes. Participants of workshops and tutorials will be charged a
marginal fee to cover the basic costs.
Workshop and tutorials proposals must include the following information:
* outline of the themes and goals of the event, including a title and a
brief abstract (less than 200 words) intended for the SEMANTiCS 2023 website
* a statement addressing why the event is important, why the event is
timely, how it is relevant to SEMANTiCS 2023 and the field of semantic
web. For the tutorials, why the presenters are qualified for a
high-quality introduction of the topic
* related workshops and conferences, i.e., specifying if this is a
continuation of a workshop series or is a new workshop to address an
emerging issue. Please provide information about past versions of this
workshop and other related workshops (including URLs and
submission/acceptance counts, if available).
* a statement addressing the quality assurance criterion that will be
used by the event organizers to select the papers for the workshops and
the presenters for the tutorials (e.g., peer review or review/evaluation
by event organizers). If a peer review process is chosen as a quality
assurance criterion for the workshops, the organizers will be
responsible for their own reviewing process. Workshop organizers will be
responsible also for their own publicity (e.g., website, timelines and
call for papers) and proceedings production.
* structure of the event and plans for generating and stimulating
discussion; how will the interaction be organized in case of a hybrid event
* desired minimum and maximum number of event participants, expected
number of participants, and (in case of previously held events) number
of registered attendees and web site for previous editions of the event
* a description of the intended audience and the expected learning outcomes
* desired prerequisite knowledge of the audience
* proposed duration of the event (i.e., half or full day), different
sessions if applicable (final time slot will be assigned in accordance
with the SEMANTiCS program)
* any equipment, room capacity, or other logistic constraints
* full contact information of all organizers of the event and main
contact person; a brief description of each organizer's background,
including relevant past experience in organizing events
Proposals for workshop and tutorial proposals must be submitted via
Easychair: https://easychair.org/my/conference?conf=sem23
=Review and Evaluation Criteria=
Workshop and tutorial proposals will be reviewed by the SEMANTiCS 2023
Workshop Chairs, as well as by the SEMANTiCS 2023 organizing committee,
according to the following criteria:
* The potential to advance the state of semantic web research and practice
* The quality assurance criterion proposed by the organizers to select
high-quality papers for workshops and presenters for tutorials
* The organizers' experience and ability to lead a successful event
* Timeliness and expected interest in the event topics
* The balance and synergy between all SEMANTiCS 2023 events
=Topics of interest include (but are not limited to)=
* Web Semantics & Linked (Open) Data
* Enterprise Knowledge Graphs, Graph Data Management and Deep Semantics
* Machine Learning & Deep Learning Techniques
* Semantic Information Management & Knowledge Integration
* Terminology, Thesaurus & Ontology Management
* Data Mining and Knowledge Discovery
* Reasoning, Rules and Policies
* Natural Language Processing and Computational Linguistics
* Social and Human aspects of Semantic Web
* Data Quality Management and Assurance
* Explainable Artificial Intelligence
* Semantics in Data Science
* Semantics of Blockchain & Distributed Ledger Technologies
* Trust, Data Privacy, and Security with Semantic Technologies
* Economics of Data, Data Services and Data Ecosystems
* Applications of Semantic Web technologies in domains such as law,
medicine, life sciences, digital humanities, mobility and smart cities, etc.
We especially invite contributions that illustrate the applicability of
the topics mentioned above for industrial purposes and/or illustrate the
business relevance of their contribution for specific industries.
Workshop proposals on emerging themes for the topics listed above are
encouraged.
In case you have additional questions concerning the submission process,
please do not hesitate to contact us via Easychair.
We are looking forward to your contribution!
Jennifer D’Souza - jennifer.dsouza(a)tib.eu
Anisa Rula - anisa.rula(a)unibs.it
Workshop & Tutorial Chairs
FYI.
---------- Forwarded message ---------
From: 'Archna Bhatia' via MWE Workshop 2023 Organizers <
mweworkshop2023(a)googlegroups.com>
Date: Wed, Feb 8, 2023 at 8:09 PM
Subject: Fwd: [Corpora-List] Deadline extension: 19th Workshop on Multiword
Expressions (MWE 2023)
To: MWE Workshop 2023 Organizers <mweworkshop2023(a)googlegroups.com>
I have no idea why my emails seem like they need to be moderated recently
(I must have sent something inappropriate, just kidding!), but even to my
response below I received a notification that it is awaiting moderation and
that either this would post or I would hear back of the moderator’s
decision. From the past two CFPs (small sample), my experience is that
posts awaiting moderation do not get posted nor do I hear of the
moderator’s decision. So could someone else forward my response?
thanks,
Archna
Begin forwarded message:
*From: *Archna Bhatia <abhatia(a)ihmc.org>
*Subject: **Re: [Corpora-List] Deadline extension: 19th Workshop on
Multiword Expressions (MWE 2023)*
*Date: *February 8, 2023 at 2:59:35 PM EST
*To: *Ada Wan <adawan919(a)gmail.com>
*Cc: *Ken Litkowski <ken(a)clres.com>, corpora(a)list.elra.info
Hi Ada,
While appropriate space is found for this discussion, let me respond to
just your first suggestion (for now): Why do you think they should be
renamed “fixed/idiomatic expressions”? What would your definition of
“fixed” and of “idiomatic” mean? How fixed would you say these expressions
would be? Is morphological variation allowed? Is variation in any of the
other linguistic aspects allowed? From my point of view, “fixed/idiomatic
expressions” results in a much restricted category than what all we
consider could be treated as multiwords.
Thanks,
Archna
On Feb 8, 2023, at 2:38 PM, Ada Wan via Corpora <corpora(a)list.elra.info>
wrote:
Hi Ken
Thanks for the message. Unfortunately, it looks like there has been no
prior discussions on any of the topics I suggested, and the earliest post I
can access dates back only to 22Nov2020. I can surely start a discussion,
but that might look to be the first/only discussion on the list? (I went
through all the conversations accessible thus far and only saw
announcements.)
Perhaps more importantly:
as this seems to be an issue that could also affect other areas of concern
to the general audience of the Corpora-List (*not just for MWEs/SIGLEX*),
is there a way that we all can make some changes in the "language space"
across the board?
Thanks and best
Ada
On Wed, Feb 8, 2023 at 5:57 PM Ken Litkowski <ken(a)clres.com> wrote:
> Dear Ada,
>
> When I added the SIGLEX discussion code back in 2010, I did so with the
> idea that we would have discussion of just like the topic of yours. The
> morph of the discussion now is located on the Google group, via
> https://groups.google.com/g/siglex-members
> <https://nam02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgroups.go…>.
> There, you will find a place "Search conversations ..." where you can add
> your topic so that all will be sent. Rather than just the announcements
> that are the mainly topics.
>
> Ken (webmaster retiree)
> On 2/8/2023 10:18 AM, Ada Wan via Corpora wrote:
>
> Hi Kilian
>
> Hope all has been well.
>
> I'm surprised that people are still "wording around" nowadays. Some
> suggestions:
>
> 1. Can't we rename "MWEs" to "fixed/idiomatic expressions" instead? One
> can reformulate these as sequences/strings/expressions of various
> lengths/vocabs in characters.
> 2. Also, one can interpret these without information/association with any
> syntactic categories, nouns or verbs etc..
> 3. They do just represent lexical info (some reflecting/encoding
> historico-social habits, though one also should be aware of the ethical
> aspects of reinforcing some "traditional values"). Perhaps a more
> sophisticated view of language could help wean practitioners from a
> mindframe that relies of "linguistic structure(s)" as we've had it thus far
> (i.e. based on "words" and "sentences")?
> 4. Re " their meaning often does not result from the direct combination of
> the meanings of their parts": non-compositionality may be a better
> description of a more realistic view of language, it should prob be our
> default expectation (instead of the cherry-picked compositional
> counterparts).
>
> I think efforts towards mitigating a mental dependency on "words" would be
> a good direction to pursue, what do you think?
> Can we get SIGLEX to update in this regard?
>
> Best
> Ada
>
>
> On Wed, Feb 8, 2023 at 11:12 AM Kilian Evang via Corpora <
> corpora(a)list.elra.info> wrote:
>
>> [Apologies for cross-postings]
>>
>>
>> ********************************************************************************
>>
>> Call for Papers: Deadline extended
>>
>> 19th Workshop on Multiword Expressions (MWE 2023)
>>
>> Organized and sponsored by SIGLEX, the Special Interest Group
>> on the Lexicon of the ACL
>>
>> Full-day workshop collocated with EACL 2023, Dubrovnik, Croatia, May 5
>> or 6, 2023
>>
>> Hybrid (on-site & on-line)
>>
>> NEW: Submission deadline: February 20, 2023
>>
>> NEW: Invited speakers announced (see below)
>>
>> NEW: Best paper award (see below)
>>
>> MWE 2023 website: https://multiword.org/mwe2023/
>> <https://nam02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fmultiword…>
>>
>>
>> ********************************************************************************
>>
>> Multiword expressions (MWEs) are word combinations that exhibit
>> lexical, syntactic, semantic, pragmatic, and/or statistical
>> idiosyncrasies (Baldwin & Kim 2010), such as by and large, hot dog,
>> pay a visit and pull one's leg. The notion encompasses closely related
>> phenomena: idioms, compounds, light-verb constructions, phrasal verbs,
>> rhetorical figures, collocations, institutionalised phrases, etc.
>> Their behaviour is often unpredictable; for example, their meaning
>> often does not result from the direct combination of the meanings of
>> their parts. Given their irregular nature, MWEs often pose complex
>> problems in linguistic modelling (e.g. annotation), NLP tasks (e.g.
>> parsing), and end-user applications (e.g. natural language
>> understanding and MT), hence still representing an open issue for
>> computational linguistics (Constant et al. 2017).
>>
>> For almost two decades, modelling and processing MWEs for NLP has been
>> the topic of the MWE workshop organised by the MWE section of SIGLEX
>> in conjunction with major NLP conferences since 2003. Impressive
>> progress has been made in the field, but our understanding of MWEs
>> still requires much research considering their need and usefulness in
>> NLP applications. This is also relevant to domain-specific NLP
>> pipelines that need to tackle terminologies most often realised as
>> MWEs. Following previous years, for this 19th edition of the workshop,
>> we identified the following topics on which contributions are
>> particularly encouraged:
>>
>> MWE processing and identification in specialized languages and
>> domains: Multiword terminology extraction from domain-specific corpora
>> (Bonin et al. 2010) is of particular importance to various
>> applications, such as MT (Semmar & Laib, 2017), or for the
>> identification and monitoring of neologisms and technical jargon
>> (Chatzitheodorou et al, 2021). We expect approaches that deal with
>> the processing of MWEs as well as the processing of terminology in
>> specialised domains can benefit from each other.
>>
>> MWE processing to enhance end-user applications: MWEs have gained
>> particular attention in end-user applications, including MT (Zaninello
>> & Birch 2020; Han et al. 2021, 2022), simplification (Kochmar et al.
>> 2020), language learning and assessment (Paquot et al. 2019;
>> Christiansen & Arnon 2017), social media mining (Maisto et al. 2017),
>> and abusive language detection (Zampieri et al. 2020; Caselli et al.
>> 2020). We believe that it is crucial to extend and deepen these first
>> attempts to integrate and evaluate MWE technology in these and further
>> end-user applications.
>>
>> MWE identification and interpretation in pre-trained language models:
>> Most current MWE processing is limited to their identification and
>> detection using pre-trained language models, but we still lack
>> understanding about how MWEs are represented and dealt with therein
>> (Nedumpozhimana & Kelleher 2021; Garcia et al. 2021, Fakharian & Cook
>> 2021), how to better model the compositionality of MWEs from semantics
>> (Moreau et al. 2018). Now that NLP has shifted towards end-to-end
>> neural models like BERT, capable of solving complex tasks with little
>> or no intermediary linguistic symbols, questions arise about the
>> extent to which MWEs should be implicitly or explicitly modelled
>> (Shwartz & Dagan, 2019).
>>
>> MWE processing in low-resource languages: The PARSEME shared tasks
>> (Ramisch et al. 2020; 2018; Savary et al. 2017), among others, have
>> fostered significant progress in MWE identification, providing
>> datasets that include low-resource languages, evaluation measures, and
>> tools that now allow fully integrating MWE identification into
>> end-user applications. A few efforts have recently explored methods
>> for the automatic interpretation of MWEs (Bhatia, et al. 2018; 2017),
>> and their processing in low-resource languages (Liu & Wang 2020; Kumar
>> et al. 2017). Resource creation and sharing should be pursued in
>> parallel with the development of methods able to capitalize on small
>> datasets (Han et al. 2020).
>>
>> Through this workshop, we would like to bring together and encourage
>> researchers in various NLP subfields to submit MWE-related research,
>> so that approaches that deal with processing of MWEs including
>> processing for low-resource languages and for various applications can
>> benefit from each other. We also intend to consolidate the converging
>> effects of previous joint workshops LAW-MWE-CxG 2018, MWE-WN 2019 and
>> MWE-LEX 2020, the joint MWE-WOAH panel in 2021, and the MWE-SIGUL 2022
>> joint session, extending our scope to MWEs in e-lexicons and WordNets,
>> MWE annotation, as well as grammatical constructions. Correspondingly,
>> we call for papers on research related (but not limited) to MWEs and
>> constructions in:
>>
>> Computationally-applicable theoretical work in psycholinguistics and
>> corpus linguistics;
>>
>> Annotation (expert, crowdsourcing, automatic) and representation in
>> resources such as corpora, treebanks, e-lexicons, and WordNets (also
>> for low-resource languages);
>>
>> Processing in syntactic and semantic frameworks (e.g. CCG, CxG, HPSG,
>> LFG, TAG, UD, etc.);
>>
>> Discovery and identification methods, including for specialized
>> languages and domains such as clinical or biomedical NLP;
>>
>> Interpretation of MWEs and understanding of text containing them;
>>
>> Language acquisition, language learning, and non-standard language
>> (e.g. tweets, speech);
>>
>> Evaluation of annotation and processing techniques;
>>
>> Retrospective comparative analyses from the PARSEME shared tasks;
>>
>> Processing for end-user applications (e.g. MT, NLU, summarisation,
>> language learning, etc.);
>>
>> Implicit and explicit representation in pre-trained language models
>> and end-user applications;
>>
>> Evaluation and probing of pre-trained language models;
>>
>> Resources and tools (e.g. lexicons, identifiers) and their integration
>> into end-user applications;
>>
>> Multiword terminology extraction;
>>
>> Adaptation and transfer of annotations and related resources to new
>> languages and domains including low-resource ones.
>>
>>
>> Shared Task
>>
>> We do not have a shared task this year, but a new release of the
>> PARSEME corpus of verbal MWEs is currently underway. We encourage
>> submission of research papers that include analyses of the new edition
>> of the PARSEME data and improvements over the results for PARSEME 2020
>> shared task as well as SemEval 2022 task 2 on idiomaticity prediction.
>>
>>
>> *** Special Track on MWEs in Clinical NLP ***
>>
>> Pursuing the MWE Section’s tradition of synergies with other
>> communities, this year, we are organizing a joint session with the
>> Clinical NLP workshop for shared papers/poster presentations. Since
>> clinical texts contain an important amount of multiword expressions
>> (e.g. medical terms or domain-specific collocations), a joint session
>> is deemed beneficial for both communities. The goal is to foster
>> future synergies that could address scientific challenges in the
>> creation of resources, models and applications to deal with multiword
>> expressions and related phenomena in the specialised domain of
>> ClinicalNLP. Submissions describing research on MWEs in the
>> specialized domain of ClinicalNLP, especially introducing new datasets
>> or new tools and resources, are welcome. Papers accepted in this track
>> will have the option to present their work in the Clinical NLP
>> workshop at ACL 2023 as well, after being presented at MWE 2023.
>>
>>
>> Invited Speakers
>>
>> We are looking forward to invited talks by two amazing speakers:
>>
>> Leo Wanner, Universitat Pompeu Fabra
>>
>> TBD
>>
>>
>> Best paper award
>>
>> All full papers in the workshop will be considered by the program
>> committee for a best paper award. The decision will be announced in
>> the closing session.
>>
>>
>> Submission formats
>>
>> The workshop invites two types of submissions:
>>
>> archival submissions that present substantially original research in
>> both long paper format (8 pages + references) and short paper format
>> (4 pages + references).
>>
>> non-archival submissions of abstracts describing relevant research
>> presented/published elsewhere which will not be included in the MWE
>> proceedings.
>>
>>
>> Paper submission and templates
>>
>> Papers should be submitted via the workshop's START submission page
>> (https://softconf.com/eacl2023/mwe2023/
>> <https://nam02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fsoftconf.…>).
>> Please choose the
>> appropriate submission format (archival/non-archival). Archival papers
>> with existing reviews will also be accepted through the ACL Rolling
>> Review. Submissions must follow the ACL 2023 stylesheet.
>>
>>
>> Archival papers with existing reviews from ACL Rolling Review will
>> also be considered. A paper may not be simultaneously under review
>> through ARR and MWE. A paper that has or will receive reviews through
>> ARR may not be submitted for review to MWE.
>>
>>
>> Important Dates
>>
>> Paper submission: February 20, 2023
>>
>> ARR paper commitment: March 6, 2023
>>
>> Notification of acceptance: March 13, 2023
>>
>> Camera-ready papers due: March 27, 2023
>>
>> Workshop: May 5 or 6, 2023
>>
>>
>> All deadlines are at 23:59 UTC-12 (Anywhere on Earth).
>>
>>
>> Organizing Committee
>>
>> Program chairs: Marcos Garcia, Voula Giouli, Lifeng Han, Shiva Taslimipoor
>>
>> Publication chair: Archna Bhatia
>>
>> Publicity chair: Kilian Evang
>>
>>
>> Anti-harassment policy
>>
>> The workshop follows the ACL anti-harassment policy.
>>
>>
>> Contact
>>
>> For any inquiries regarding the workshop, please send an email to the
>> Organizing Committee at mweworkshop2023(a)googlegroups.com.
>> _______________________________________________
>> Corpora mailing list -- corpora(a)list.elra.info
>> https://list.elra.info/mailman3/postorius/lists/corpora.list.elra.info/
>> <https://nam02.safelinks.protection.outlook.com/?url=https%3A%2F%2Flist.elra…>
>> To unsubscribe send an email to corpora-leave(a)list.elra.info
>>
>
> _______________________________________________
> Corpora mailing list -- corpora@list.elra.infohttps://list.elra.info/mailman3/postorius/lists/corpora.list.elra.info/ <https://nam02.safelinks.protection.outlook.com/?url=https%3A%2F%2Flist.elra…>
> To unsubscribe send an email to corpora-leave(a)list.elra.info
>
> --
> Ken Litkowski TEL.: 301-482-0237
> CL Research EMAIL: ken(a)clres.com
> 9208 Gue Road Home Page: http://www.clres.com <https://nam02.safelinks.protection.outlook.com/?url=http%3A%2F%2Fwww.clres.…>
> Damascus, MD 20872-1025 USA Blog: http://www.clres.com/blog <https://nam02.safelinks.protection.outlook.com/?url=http%3A%2F%2Fwww.clres.…>
>
> _______________________________________________
Corpora mailing list -- corpora(a)list.elra.info
https://list.elra.info/mailman3/postorius/lists/corpora.list.elra.info/
To unsubscribe send an email to corpora-leave(a)list.elra.info
--
Archna Bhatia, Ph.D.
Research Scientist, Institute for Human & Machine Cognition
15 SE Osceola Ave, Ocala, FL 34471
(352) 387-3061
--
You received this message because you are subscribed to the Google Groups
"MWE Workshop 2023 Organizers" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to mweworkshop2023+unsubscribe(a)googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/mweworkshop2023/A86DBC58-1695-45C4-AC5D-9…
<https://groups.google.com/d/msgid/mweworkshop2023/A86DBC58-1695-45C4-AC5D-9…>
.
For more options, visit https://groups.google.com/d/optout.
--
*News*:
*CFPs and participants*: HealTAC23
<http://healtex.org/healtac-conference-series/> (Manchester June 14-16) |
MWE23 <https://multiword.org/mwe2023/>@EACL (joint w ClinicalNLP@ACL)
*our work:*
ClinicalMT
<https://scholar.google.com/citations?view_op=view_citation&hl=en&user=_vf3E…>@WMT22_w_EMNLP
| Meta-eval Tutorial
<https://scholar.google.com/scholar?oi=bibs&hl=en&cluster=5418617987092956707>
/
HumanEval <https://aclanthology.org/2022.lrec-1.2.pdf> (paper_w_tool) /
TranslationUncertainty <https://aclanthology.org/2022.lrec-1.2.pdf> (paper)
@LREC22 | ClinicalTextMinging
<https://github.com/poethan/TransformerCRF> (ML-Tools)
@HealTAC2022 *|* Covid-Topic-Modeling <https://arxiv.org/abs/2301.03029> (
*arXiv-2023*)
Serving as ACL2023 <https://2023.aclweb.org> AC (area chair): resource and
evaluation |
MWE-SIGLEX <https://multiword.org> elected Standing Committee Board member
(2022-2024) |
Ph.D. in Computer Application (Machine Translation, thesis
<https://doras.dcu.ie/26559/>), M.Sc. (Software Engineering, thesis
<https://arxiv.org/abs/1703.08748> *excellent-award*), B.Sc. (Math, *GPA
80/100*)
Google-Scholar <https://scholar.google.nl/citations?user=_vf3E2QAAAAJ&hl=en> ,
Presentation <https://www.slideshare.net/AaronHanLiFeng>(ppt),
Research-Gate <https://www.researchgate.net/profile/Aaron_L-F_Han>
Google-site <https://sites.google.com/view/poetgarden/home> Linkedin
<https://www.linkedin.com/in/aaronhan/>, Writer
<https://books.apple.com/us/author/lifeng-han/id1602229739>(poetry)
Postdoctoral <https://www.research.manchester.ac.uk/portal/lifeng.han.html>
Research Associate at HECTA
<https://www.research.manchester.ac.uk/portal/en/researchers/goran-nenadic(4…>
group, The University of Manchester, UK
https://www.research.manchester.ac.uk/portal/lifeng.han.html Former: ADAPT
Research Centre & DCU, Ireland
CALL FOR PARTICIPATION
IberLEF 2023 Task - HOPE: Multilingual Hope Speech detection
Held as part of the evaluation forum IberLEF 2023
<https://sites.google.com/view/iberlef-2023> in the XXXIX edition of the
International Conference of the Spanish Society for Natural Language
Processing (SEPLN 2023 <http://sepln2023.sepln.org/en/home/>)
September 26, 2023. Jaén, Andalusia, Spain
Codalab link: https://codalab.lisn.upsaclay.fr/competitions/10215
Dear All,
We are inviting researchers and students to participate in the
shared-task HOPE:
Multilingual Hope Speech detection, held as part of IberLEF 2023, the
shared evaluation campaign for Natural Language Processing systems in
Spanish and other Iberian languages, collocated with SEPLN 2023 Conference.
The HOPE shared task is related to the inclusion of vulnerable groups and
focuses on the study of the detection of hope speech, in pursuit of
equality, diversity and inclusion. This task was previously organized at
the second workshop on Language Technology for Equality, Diversity and
Inclusion (LT-EDI-2022), as a part of ACL 2022, but for five languages:
Tamil, Malayalam, Kannada, English and Spanish. The novelties of this
shared task are: i) it is organized in two languages, Spanish and English;
and ii) it provides an expanded and improved dataset. It consists of two
subtasks:
-
Subtask 1: Hope Speech detection in Spanish. Given a Spanish tweet,
identifying whether it contains hope speech or not. The possible categories
for each text are:
-
HS: Hope Speech.
-
NHS: Non Hope Speech.
-
Subtask 2: Hope Speech detection in English. Given an English Youtube
comment, identifying whether it contains hope speech or not. The possible
categories for each text are:
-
HS: Hope Speech.
-
NHS: Non Hope Speech.
In both subtasks there will be a real time leaderboard and the participants
will be allowed to make a maximum of 10 submissions through CodaLab, from
which each team will have to select the best one for ranking.
The dataset for this task comprises two corpus, one in Spanish and another
in English. The Spanish corpus was collected between 2021 and 2022. It is
an extension of the SpanishHopeEDI dataset (García-Baena et al., 2023) to
be published in the journal Language Resources and Evaluation, which was
used in the ACL LT-EDI-2022 Spanish task (Chakravarthi et al., 2022). It
consists of a set of LGBT-related tweets annotated as HS (Hope Speech) or
NHS (Non Hope Speech). A tweet is considered as HS if the text: i)
explicitly supports the social integration of minorities; ii) is a positive
inspiration for the LGTBI community; iii) explicitly encourages LGTBI
people who might find themselves in a situation; or iv) unconditionally
promotes tolerance. On the contrary, a tweet is marked as NHS if the text:
i) expresses negative sentiment towards the LGTBI community; ii) explicitly
seeks violence; or iii) uses gender-based insults. The English corpus is an
extension of the English part of the HopeEDI dataset (Chakravarthi, 2020).
It consists of comments posted on YouTube videos on a wide range of
socially relevant topics such as Equality, Diversity and Inclusion,
including LGBTIQ issues, COVID-19, women in STEM, Black Lives Matter, etc.
To download the data and participate, go to:
https://codalab.lisn.upsaclay.fr/competitions/10215.
Best regards,
The HOPE 2023 organizing committee
References
-
García-Baena, D., García-Cumbreras, M.A., Jiménez-Zafra, S.M.,
García-Díaz, J.A., Valencia-García, R. (2023). Hope Speech Detection in
Spanish. The LGBT case. Language Resources and Evaluation. To be published.
-
Chakravarthi BR (2020) HopeEDI: A multilingual hope speech detection
dataset for equality, diversity, and inclusion. In: Proceedings of the
Third Workshop on Computational Modeling of People’s Opinions, Personality,
and Emotion’s in Social Media, Association for Computational Linguistics,
Barcelona, Spain (Online), pp 41–53, URL
https://aclanthology.org/2020.peoples-1.5
-
Chakravarthi, B. R., Muralidaran, V., Priyadharshini, R., Cn, S.,
McCrae, J. P., García-Cumbreras, M. Á., Jiménez-Zafra, S. M.,
Valencia-García, R., Kumar Kumaresan, P., Ponnusamy, R., García-Baena, D. &
García-Díaz, J. (2022, May). Overview of the Shared Task on Hope Speech
Detection for Equality, Diversity, and Inclusion. In Proceedings of the
Second Workshop on Language Technology for Equality, Diversity and
Inclusion (pp. 378-388). https://aclanthology.org/2022.ltedi-1.58
Important dates
-
Release of training + development corpora: Feb 13, 2023.
-
Release of test corpora and start of evaluation campaign: Mar 13, 2023
-
End of evaluation campaign (deadline for runs submission): Mar 28, 2023.
-
Publication of official results: Mar 30, 2023.
-
Paper submission: Abr 25, 2023.
-
Review notification: May 23, 2023.
-
Camera ready submission: Jun 9, 2023.
-
IberLEF Workshop (SEPLN 2023): Sep 26, 2023 (Jaén, Andalusia, Spain)
-
Publication of proceedings: Sep ??, 2023
Organizing committee
-
Miguel Ángel García Cumbreras (SINAI, Universidad de Jaén)
-
Daniel García-Baena (SINAI, Universidad de Jaén)
-
Bharathi Raja Chakravarthi (University of Galway)
-
Salud María Jiménez-Zafra (SINAI, Universidad de Jaén)
-
José Antonio García-Díaz (UMUTeam, Universidad de Murcia)
-
Rafael Valencia-García (UMUteam, Universidad de Murcia)
-
L. Alfonso Ureña-López (SINAI, Universidad de Jaén)
[image: Universidad de Jaén] <http://www.uja.es/> *Salud María Jiménez
Zafra*
sjzafra(a)ujaen.es
Universidad de Jaén
Grupo de Investigación SINAI <http://sinai.ujaen.es/> | Departamento de
Informática
EPS Jaén, Edificio A3, Despacho 219
Campus Las Lagunillas s/n 23071 - Jaén | +34 953212992
[image: Universidad de Jaén] <http://www.uja.es/>
[Apologies for cross-posting]
*************************
Call for Participation
*************************
Task: *Homotransphobia Detection in Italian (HODI)* at EVALITA 2023
<https://www.evalita.it/campaigns/evalita-2023/>
Info: https://hodi-evalita.github.io/
Final Workshop: 7th - 8th September 2023, Parma, Italy
*Registration is required to obtain data and participate in the shared
task.*
To register, follow the instruction here
https://hodi-evalita.github.io/how_to_participate/
-----------------------------------------------------
🌈 The HODI Shared Task 🌈
-----------------------------------------------------
We invite participants to participate in the first shared task of
homotransphobia detection in Italian (HODI). Despite the NLP community’s
interest in hate speech detection datasets and models, very few studies
covered homotransphobia. This is a concern, due to the target-oriented
nature of hate speech: recent studies have revealed that hate speech
detection methods cannot be used to multiple sorts of hate speech targets.
HODI is organized according to two main subtasks:
** Subtask A - Homotransphobia detection:** the objective is to detect if a
text is homotransphobic or not.
** Subtask B - Explainability:** the objective is to extract the rationales
of the classification models trained for Subtask A.
Further details on the task, data, and evaluation are available at the task
website: <https://di.unito.it/sardistance2020>
https://hodi-evalita.github.io/
-----------------------
Important Dates
-----------------------
- 7 Feb 2023: Training data available (training period starts)
- 2 May 2023 Test data available
- 9 May 2023 Systems results due to organizers
- 30 May 2023 Results notification to participants
- 14 Jun 2023 Technical report due to organizers
- 10 Jul 2023 Reviews to participants (peer-reviews)
- 25 Jul 2023 Camera ready due to organizers
- 7 - 8 Sep 2023 EVALITA Workshop
----------------
Organizers
----------------
Debora Nozza, Bocconi University
Greta Damo, Bocconi University
Alessandra Teresa Cignarella, University of Turin
Tommaso Caselli, University of Groningen
Viviana Patti, University of Turin
--
Tommaso Caselli, Ph.D.
Senior Assistant Professor in Computational Semantics
Faculty of Arts, Rijksuniversiteit Groningen
The Netherlands
----------------------------
https://xs4all.academia.edu/TommasoCasellihttps://www.researchgate.net/profile/Tommaso_Caselli
Twitter: @tommaso_caselli
IberLEF 2023 Task ClinAIS: Automatic Identification of Sections in Clinical Documents
Website: https://ixa2.si.ehu.eus/clinais/
ClinAIS: will be organized as part of IberLEF 2023, at the SEPLN 2023 Conference (Jaen, September 2023).
The ClinAIS task presented at IberLEF 2023 aims to tackle the problem of automatic identification of sections in unstructured Spanish clinical documents. The task is focused on identifying 7 predefined medical sections: Present Illness, Derived from/to, Past Medical History, Family history, Exploration, Treatment and Evolution in ECNs, mainly progress notes.
The successful resolution of this task will enable the improvement of higher level applications that can extract valuable, actionable information from clinical documents, such as medical entity recognition, patient cohort retrieval, and temporal relation extraction. This will ultimately improve patient care and clinical decision-making.
IMPORTANT DATES
- March 2023 Release of Train + Dev Sets and Evaluation Library
- April 2023 Release of Test and Background Set
- May 2023 Submission of Results.
- May 2023 System Paper Submission Deadline.
- June 2023 Notification to Authors.
- June 2023 Camera Ready Submission Deadline.
- September 2023 Publication of Proceedings.
- September 2023 IberLEF within SEPLN 2023.
ORGANIZERS
IOMED
María Vivó NLP Data Scientist
Paula Chocrón NLP Engineer & Researcher
Gabriel de Maeztu Co-founder & CTO at IOMED
HiTZ Center
Iker de la Iglesia Research Scientist
Aitziber Atutxa Professor at UPV/EHU
Koldo Gojenola Professor at UPV/EHU
Esther Miranda Technical Staff
Contact: ixa.iomed-clinais(a)ehu.es
Registration: https://ixa2.si.ehu.eus/clinais/registration
Dear colleagues,
Field Matters workshop 2023 extended the submission deadline. We accept papers until February 23.
The Second Workshop on NLP Applications to Field Linguistics (Field Matters 2023) will take place at EACL 2023 (https://2023.eacl.org/) in Dubrovnik, Croatia on May 5 or 6 (online participants are also welcomed).
We accept papers on the following topics:
- Application of NLP to field linguistics workflow;
- Transfer learning for under-resourced language processing;
- The use of fieldwork data to build NLP systems;
- Modeling morphology and syntax of typologically diverse languages in the low-resource setting;
- Speech processing for under-resourced languages;
- Computational analysis of field linguistics datasets;
- Using technology for preserving culture via language;
- Improving ways of interaction with Indigenous communities;
- Machine-readable field linguistic datasets.
You can find more information on the submission process and format requirements on our web-site https://field-matters.github.io/cfp2023
Subscribe to our Twitter page to follow the updates https://twitter.com/field_matters
Best regards,
Anna Postnikova
Field Matters workshop organizing committee
Applications are invited for a 1-year research scholarship in English Linguistics at the Department of Foreign Languages and Literatures, University of Verona, within the project “Interconnected Nord-Est Innovation Ecosystem (iNEST)”, financed by NextGenerationEU in the context of the National Recovery and Resilience Plan (NPRR).
The project entails the compilation and analysis of a corpus of English-language texts promoting and describing prominent urban and extra-urban destinations in the Veneto region of Italy as well as the creation of a set of guidelines for tourism promotion. Experience in corpus linguistics, tourism discourse and discourse analysis, among others, are requested from applicants.
The closing date for applications is March 6th.
For more information about the research project, requirements and application process, you can find the complete call at this link:
https://www.univr.it/en/job-vacancies/assegnisti-di-ricerca/assegni-di-rice…
Best wishes,
Valeria Franceschi
----------------------------------------
Valeria Franceschi
Temporary Assistant Professor - English Language and Translation (L-LIN/12)
Department of Foreign Languages and Literatures
University of Verona