Dear all,
We are happy to release six corpora (1.3 Million tokens) with full morphological annotations for (Palestinian, Lebanese, Yemeni, Iraqi, Libyan, and Sudanese) dialects. All are annotated using the LDC’s SAMA tagsets.
Search: https://portal.sina.birzeit.edu/curras
Download: https://portal.sina.birzeit.edu/curras/about-en.html
This video demonstrates how to search the corpora in Arabic/English.
https://twitter.com/mjarrar/status/1604078695068598273
#arabic_language_day We are very happy to release 6 Arabic dialects corpora (1.3 million tokens, morphologically annotated): Curras(Palestinian), Baladi (Lebanese), Lisani (Yemeni, Irqi, Libyan, Sudanese) by @UN, @BirzeitU and @AUB_Lebanon. https://t.co/ZP3hqVSRWc

Mustafa Jarrar
twitter.com
Best
--Mustafa
__________________________
Mustafa Jarrar, PhD
Professor of Artificial Intelligence
Chair, PhD Program in Computer Science
Birzeit University, Palestine
Whatsapp:+972599662258 | mjarrar(a)birzeit.edu
http://www.jarrar.info
Dear all --
In celebration of Arabic Language Day (Dec 18), we are happy to announce
the first release of Maknuune, the Open Source Palestinian Arabic Lexicon.
www.palestine-lexicon.org
Maknuune has over 36K entries from 17K lemmas, and 3.7K roots. All entries
include diacritized Arabic orthography, phonological transcription and
English glosses. Some entries are enriched with additional information such
as broken plurals and templatic feminine forms, associated phrases and
collocations, Standard Arabic glosses, and examples or notes on grammar,
usage, or location of collected entry.
We are honored to have received comments of endorsement from Profs. Noam
Chomsky, Hamid Dabashi, Abdelkader Fassi Fehri, Clive Holes, Ilan Pappe,
and Dr. Walid Saif.
https://sites.google.com/nyu.edu/palestine-lexicon/endorsements
--
Nizar Habash
Professor of Computer Science
New York University Abu Dhabi
A fully funded 4-year PhD position at the intersection of NLP and Topology
is offered at Queen Mary University of London (QMUL), School of Electronic
Engineering and Computer Science. It is part of the collaboration scheme
between QMUL and the China Scholarship Council (CSC), and is therefore
available for Chinese candidates only.
The CSC scheme provides full tuition fee waiver and living stipend for 4
years, and requires (among other things) an English Language test (IELTS)
from the last 2 years. You can read more about the scheme's requirements
here
<https://www.qmul.ac.uk/scholarships/items/china-scholarship-council-scholar…>
.
I am looking for brilliant candidates who hold (or about to hold) MSc in
Computer Science with a strong NLP research background. Prospective
students can learn more about the project here
<http://eecs.qmul.ac.uk/phd/phd-studentships/csc-phd-studentships-in-electro…>,
under the section: *Understanding neural representations via their
algebraic-topological structures*. The PhD student will work in an
interdisciplinary environment, and will be at the forefront of NLP
research.
If you are interested, please get in touch with me on:
h.dubossarsky(a)qmul.ac.uk.
Bests,
Haim
We kindly invite you to participate in LongEval 2023, a shared task on longitudinal evaluation of NLP models at CLEF 2023.
CALL FOR CONTRIBUTION
LongEval @ CLEF 2023
Longitudinal Evaluation of Models Performance
https://clef-longeval.github.io <https://clef-longeval.github.io/>
Lab description:
The LongEval aims at identifying the types of models that offer better temporal persistence for NLP tasks on data that evolves across time in both shorter and longer time periods. LongEval is built on a common framework for its Information Retrieval and Text Classification tasks: for one system, we evaluate its efficiency when operating on test data acquired at the same time than the training data, when operating on data acquired shortly after time t (sub-task 1), and when operating on data acquired at time t" ( longer after time t, sub-task 2). For each sub-task of each task, two evaluation measures are proposed: an absolute quality measure, and a relative drop compared to an initial time t test result for each system
LongEval 2023 proposes two tasks:
• Task 1: Information Retrieval. For this task, the data is a sequence of Web document collections and queries, each containing a few million documents and hundreds of queries, provided by Qwant. Relevance assessments are to be computed using a Click Model acquired from real users of the Qwant search engine. As the initial corpus contains only French documents, an automatic translation into English will be provided.
• Task 2: Text Classification. For this task, the training data is the TM-Senti sentiment analysis dataset extended with a development set and three human-annotated novel test sets for submission evaluation. TM-Senti is a general large-scale Twitter sentiment dataset in the English language, spanning over a 9-year period from 2013 to 2021. Tweets are labeled for sentiment as either “positive” or “negative”. The annotation is performed using distant supervision based on a manually curated list of emojis and emoticons.
You can register for the task at: https://clef2023-labs-registration.dei.unipd.it/ <https://clef2023-labs-registration.dei.unipd.it/>
Lab Organizers:
Rabab Alkhalifa, Iman Bilal, Hsuvas Borkakoty, Jose Camacho-Collados, Romain Deveaud, Alaa El-Ebshihy, Luis Espinosa-Anke, Gabriela Gonzalez-Saez, Petra Galuscakova, Lorraine Goeuriot, Elena Kochkina, Maria Liakata, Daniel Loureiro, Harish Tayyar Madabushi, Philippe Mulhem, Florina Piroi, Martin Popel, Christophe Servan, and Arkaitz Zubiaga.
Important dates:
Release of training data: 03/01/2023
Release of test data: 30/04/2023
Runs submissions date: 30/06/2023
LongEval Workshop: during CLEF 2023, Thessaloniki, 18-21 September 2023.
Workshop on Language-Based AI Agent Interaction with Children
https://aichildinteraction.github.io/
February 21st, 2023, in Los Angeles, USA & Virtual (Hybrid Format)
Paper Submission Deadline: January 13th, 2023 (extended)
Easychair: https://easychair.org/conferences/?conf=aiaic23
Contact: https://groups.google.com/g/ai-child-interactions or
aichildinteraction(a)gmail.com
===================================================
In this workshop, we aim to bring together researchers looking into
multimodal interactions between children and artificial agents to
discuss research problems that center around interactivity and go beyond
just processing child speech. We are interested in discussing approaches
to collecting and annotating datasets involving child speech, intent
classification in child speech, designing dialogue flow with artificial
agents that primarily interact with children, as well as repair
strategies, active listening behavior, and other aspects of dialogue
modeling. Moreover, multiparty conversations involving several children,
children, and their adult caregivers or several artificial agents are of
particular interest to this workshop.
Acknowledging the early-stage nature of research in this area, the
workshop will invite short position papers as contributions. In addition
to selected talks that will be invited based on the submitted papers, we
will host roundtable discussions allowing attendees to discuss ideas,
share challenges they have faced, and highlight ideas for future research.
## Topics of Interest
The workshop welcomes contributions across a wide range of topics
including, but not limited to:
Natural Language Understanding of Child Speech
Dialogue Modeling of Child-Agent, Child-Robot, Child-Child, and
Child-Adult Speech
Conversational Flow and Repair in Dialogue with Children
Multiparty-Interaction Involving Children
Multimodal Processing of Child Interactions
Automatic Speech Recognition of Child Speech
Evaluating Child Interactions with Artificial Agents/Robots
Challenges in Designing Interactions for Children
Datasets of Child-Child, Child-Adult, or Child-Agent/Robot Interaction
Ethics and Responsible AI for Child-Agent/Robot Interaction
Related Topics
## Important Dates
- Paper submission deadline: January 13th, 2023 (extended)
- Author Notification deadline: February 1st, 2023
- Workshop: February 21st, 2023 (morning session)
## Submission Guidelines
We invite short position papers of 3-4 pages (plus additional pages for
references and appendices without page limitation), including work in
progress containing preliminary results, technical reports, case
studies, surveys, and state-of-the-art research in language-based AI
agent interactions with children. Recently submitted or published papers
are welcome to be submitted to this workshop if they are highly relevant
to the topic of the workshop. Please select the appropriate track during
the EasyChair submission to mark the submission accordingly.
Papers will be reviewed for their relevance, novelty, and scientific and
technical soundness.
Submissions do not need to be anonymized for review. All manuscripts
must be written in English and submitted electronically in PDF format
via EasyChair: https://easychair.org/conferences/?conf=aiaic23
Accepted papers will be published on the workshop website. However,
papers are still considered non-archival and can be submitted to other
conferences.
Authors of accepted papers are expected to present their paper during
the workshop in the form of a short talk, which can either be given in
person in Los Angeles, USA, or virtually via Zoom.
Authors should use the official IWSDS template:
Latex Style and Template:
https://drive.google.com/open?id=1mnzjvTlIVEsdPb2IZXbzxU8WRJj3mLiJ
Overleaf: https://www.overleaf.com/read/djcrwzgrdjvj
Word Template:
https://drive.google.com/open?id=1WmO9iLvJtO0cH1E0VSC1bPsC0vRDpzbd
## Contact
If you have questions, please get in touch via our public Google Group
https://groups.google.com/g/ai-child-interactions or by sending an
e-mail to aichildinteraction(a)gmail.com
Dear All,
CNRS offers one permanent research position (research fellow) in computer sciences for under-resourced languages.
https://gestionoffres.dsi.cnrs.fr/fo/offres/detail-en.php?&offre_id=18
Details about the selection process and the kind of positions offered by CNRS is presented here:
https://www.cnrs.fr/en/competitive-entrance-examinations-researchers-womenm…. Note that these are permanent (non tenured) positions, after a probatory period of one year.
Lattice is a lab in Paris (https://lattice.cnrs.fr/, funded by CNRS, Ecole normale supérieure-PSL and Université Sorbonne nouvelle) with a strong team both in linguistics and in natural language processing. Lattice would be happy to host the above position (candidates are free to mention in their application the lab that best suits their research projects among CNRS labs in linguistics with a strong NLP component).
Candidates with a strong CV who would like apply at Lattice for this position can contact Sophie Prévost or Thierry Poibeau (both firstname.lastname(a)ens.psl.eu <mailto:firstname.lastname@ens.psl.eu>) with a CV and a research proposal. A record of international publications (including major computer science conferences) is mandatory.
Best regards,
Thierry Poibeau
Dear colleagues,
We are pleased to share the information of The 3rd Workshop on Financial
Technology on the Web (FinWeb) with you. FinWeb-2023 is to be held in April
30-May 4, 2023 in conjunction with The Web Conference 2023. We invite the
submission of papers on original research in this area. We offer the prize
in the main track (USD$500 to the Best Paper Award winner). In addition to
the main track, there is a shared task collocated with FinWeb-2023:
Exaggerated numeral detection (ExNum) shared task.
*Submission Deadline: Feb. 06, 2023*
The proceedings of the workshops will be published jointly with the
conference proceedings.
Please refer to the site of FinWeb-2023 for more details:
https://sites.google.com/nlg.csie.ntu.edu.tw/finweb-2023/home
Sincerely,
Chung-Chi Chen, Hen-Hsen Huang, Hiroya Takamura, Hsin-Hsi Chen
FinWeb-2023 Organizers
---
陳重吉 (Chung-Chi Chen), Ph.D.
Researcher
Artificial Intelligence Research Center, National Institute of Advanced
Industrial Science and Technology, Japan
E-mail: c.c.chen(a)acm.org <cjchen(a)nlg.csie.ntu.edu.tw>
Website: http://cjchen.nlpfin.com
The Centre for English Corpus Linguistics (CECL) is organizing the fifth edition of its Learner Corpus Research Summer School, which will take place at the University of Louvain, Belgium, from 3 to 7 July 2023.
The aim of the Summer School is to introduce learner corpus research through a series of lectures, workshops and hands-on sessions. It is intended both for (junior and seasoned) researchers who have recently embarked on a learner corpus project and those who simply want to know more about this exciting field of research. No prior knowledge in learner corpus research is expected.
Teaching staff: Sylvie De Cock, Gaëtanelle Gilquin, Sylviane Granger, Marie-Aude Lefer, Magali Paquot and Jennifer Thewissen.
The following topics will be covered:
Learner corpus design/compilation
Learner corpus annotation (including error annotation)
Learner corpus analysis (including data coding, Contrastive Interlanguage Analysis, Integrated Contrastive Model, introduction to statistics for learner corpus research)
Interpreting learner corpus data
Applications of learner corpus research
Hands-on sessions will help participants familiarize themselves with a range of software tools for linguistic analysis and native and learner corpora, in particular the International Corpus of Learner English and the Louvain International Database of Spoken English Interlanguage.
Participants will have the opportunity to give a brief presentation of their project at the Learner Corpus Colloquium which will take place on the first day.
The Summer School will also feature one-to-one sessions with members of the CECL team to answer questions related to the design of individual projects.
Registration will open on 23 January 2023. Please note that the number of participants is limited to 30.
Information about the Summer School can be obtained from the following website: https://uclouvain.be/en/research-institutes/ilc/cecl/learner-corpus-researc…
Dear colleagues,
*Research in Corpus Linguistics* (RiCL, ISSN 2243-4712), the official
journal of the *Spanish Association for Corpus Linguistics* (AELINCO), is
seeking for contributions to be published in the first issue of 2023 – June
(11/1).
Authors interested in publishing their work in this issue should* submit
their contributions by February 15*.
RiCL invites previously unpublished submissions in four main forms:
1. *Papers reporting on* research based on or derived from *corpora*.
2. *Research* *papers**reporting* * on* *corpus* *construction*,
annotation, the development and application of corpus tools, software,
etc.
3. *Book reviews* in the field of Corpus Linguistics.
4. *Review articles* in the field of Corpus Linguistics.
Further information about RiCL’s editorial policies can be found here
<https://urldefense.com/v3/__https://ricl.aelinco.es/index.php/ricl/editoria…>.
Specific areas of interest include corpus design, compilation, and
typology; discourse, literary analysis and corpora; corpus-based
grammatical studies; corpus-based lexicology and lexicography; corpora,
contrastive studies and translation; corpus and linguistic variation;
corpus-based computational linguistics; corpora, language acquisition and
teaching; and special uses of corpus linguistics.
Please, note that submissions must adhere to the submission guidelines
<https://urldefense.com/v3/__https://ricl.aelinco.es/index.php/ricl/about/su…>
of RiCL.
Further information at: *https://**ricl**.aelinco.es*
<https://urldefense.com/v3/__https://ricl.aelinco.es/__;!!D9dNQwwGXtA!S3r4UM…>
All the best,
Paula Rodríguez-Puente and Carlos Prado-Alonso
*Editors-in-chief*
--
Paula Rodríguez Puente
paula.r.puente(a)gmail.com
http://www.usc-vlcg.es/PRP.htm
In this newsletter:
LDC 2023 membership discounts now available
Approaching deadline for Spring 2023 data scholarship applications
30th Anniversary Highlight: AMR
________________________________
New publications:
CAMIO Transcription Languages<https://catalog.ldc.upenn.edu/LDC2022T07>
Global TIMIT Thai<https://catalog.ldc.upenn.edu/LDC2022S13>
Third DIHARD Challenge Evaluation<https://catalog.ldc.upenn.edu/LDC2022S14>
LDC 2023 membership discounts now available
Now through March 1, 2023, current 2022 members receive a 10% discount for renewing their membership, and new or returning organizations receive a 5% discount. Membership remains the most economical way to access current and past LDC releases. Consult Join LDC<https://www.ldc.upenn.edu/members/join-ldc> for details on membership options and benefits.
Approaching deadline for Spring 2023 data scholarship applications
Attention students: don't miss out on the chance to receive no-cost access to LDC data for your research. Applications for Spring 2023 data scholarships are due January 15, 2023. For more information on requirements and program rules, see LDC Data Scholarships<https://www.ldc.upenn.edu/language-resources/data/data-scholarships>.
30th Anniversary Highlight: AMR
Abstract Meaning Representation (AMR) annotation was developed by LDC, SDL/Language Weaver, Inc., the University of Colorado's Computational Language and Educational Research group, and the Information Sciences Institute at the University of Southern California. It is a semantic representation language that captures "who is doing what to whom" in a sentence. Each sentence is paired with a graph that represents its whole-sentence meaning in a tree-structure. AMR utilizes PropBank frames, non-core semantic roles, within-sentence coreference, named entity annotation, modality, negation, questions, quantities, and so on to represent the semantic structure of a sentence largely independent of its syntax.
LDC's Catalog contains three cumulative English AMR publications: Release 1.0 (LDC2014T12<https://catalog.ldc.upenn.edu/LDC2014T12>), Release 2.0 (LDC2017T10<https://catalog.ldc.upenn.edu/LDC2017T10>), and Release 3.0 (LDC2020T02<https://catalog.ldc.upenn.edu/LDC2020T02>). The combined result in AMR 3.0 is a semantic treebank of roughly 59,255 English natural language sentences from broadcast conversations, newswire, weblogs, web discussion forums, fiction, and web text, and includes multi-sentence annotations.
LDC has also published Chinese Abstract Meaning Representation 1.0 (LDC2019T07<https://catalog.ldc.upenn.edu/LDC2019T07>) and 2.0 (LDC2021T13<https://catalog.ldc.upenn.edu/LDC2021T13>), developed by Brandeis University and Nanjing Normal University. These corpora contain AMR annotations for approximately 20,000 sentences from Chinese Treebank 8.0 (LDC2013T21<https://catalog.ldc.upenn.edu/LDC2013T21>). Chinese AMR follows the basic principles developed for English, making adaptations where necessary to accommodate Chinese phenomena.
Abstract Meaning Representation 2.0 - Four Translations (LDC2020T07<https://catalog.ldc.upenn.edu/LDC2020T07>), developed by the University of Edinburgh, School of Informatics, consists of Spanish, German, Italian, and Chinese Mandarin translations of a subset of sentences from AMR 2.0.
Visit LDC's Catalog <https://catalog.ldc.upenn.edu/> for more details about these publications.
________________________________
New publications:
CAMIO Transcription Languages<https://catalog.ldc.upenn.edu/LDC2022T07> was developed by LDC and contains nearly 70,000 images of machine printed text with corresponding annotations and transcripts in 13 languages: Arabic, Chinese, English, Farsi, Hindi, Japanese, Kannada, Korean, Russian, Tamil, Thai, Urdu, and Vietnamese. This corpus is a subset of data created for a broader effort to support the development and evaluation of optical character recognition and related technologies for 35 languages across 24 unique script types.
Most images were annotated for text localization, resulting in over 2.3M line-level bounding boxes; 1250 images per language were also annotated with orthographic transcriptions of each line plus specification of reading order, yielding over 2.4M tokens of transcribed text. The resulting annotations are represented in an XML output format defined for this corpus. Data for each language is partitioned into test, train, or validation sets.
2022 members can access this corpus through their LDC accounts. Non-members may license this data for a fee.
*
Global TIMIT Thai<https://catalog.ldc.upenn.edu/LDC2022S13> consists of 12 hours of read speech and time-aligned transcripts in Standard Thai from 50 speakers (33 female, 17 male) reading 120 sentences selected from the Thai National Corpus<https://www.arts.chula.ac.th/ling/tnc/>, the Thai Junior Encyclopedia<https://www.au.edu/royal-activities/the-thai-encyclopedia-for-youth-project…>, and Thai Wikipedia, for a total of 6000 utterances. Data was collected in 2016. Speakers were recruited in the Bangkok metropolitan area; they were native Thais, fluent in Standard Thai, and literate.
This data set was developed as part of LDC's Global TIMIT project which aims to create a series of corpora in a variety of languages with a similar set of key features as in the original TIMIT Acoustic-Phonetic Continuous Speech Corpus (LDC93S1)<https://catalog.ldc.upenn.edu/LDC93S1> which was designed for acoustic-phonetic studies and for the development and evaluation of automatic speech recognition systems.
2022 members can access this corpus through their LDC accounts. Non-members may license this data for a fee.
*
Third DIHARD Challenge Evaluation<https://catalog.ldc.upenn.edu/LDC2022S14> was developed by LDC and contains 33 hours of English and Chinese speech data along with corresponding annotations used in support of the Third DIHARD Challenge<https://dihardchallenge.github.io/dihard3>.
The DIHARD third development and evaluation sets were drawn from diverse sources including monologues, map task dialogues, broadcast interviews, sociolinguistic interviews, meeting speech, speech in restaurants, clinical recordings, and amateur web videos. Annotations include diarization and segmentation.
2022 members can access this corpus through their LDC accounts. Non-members may license this data for a fee.
To unsubscribe from this newsletter, log in to your LDC account<https://catalog.ldc.upenn.edu/login> and uncheck the box next to "Receive Newsletter" under Account Options; or contact LDC for assistance.
Membership Coordinator
Linguistic Data Consortium<ldc.upenn.edu>
University of Pennsylvania
T: +1-215-573-1275
E: ldc(a)ldc.upenn.edu<mailto:ldc@ldc.upenn.edu>
M: 3600 Market St. Suite 810
Philadelphia, PA 19104