September 2022 - Corpora

PhD position in NL4XAI project (Universitat Autònoma de Barcelona-UAB, Spain)
by bbkhse＠gmail.com 16 Sep '22

16 Sep '22

A Ph.D. position is offered within the framework of NL4XAI: Interactive Natural Language Technology for Explainable Artificial Intelligence, a project funded by the European Union’s Horizon 2020 research and innovation program under the Marie Skłodowska-Curie Grant Agreement No. 860621. Ph.D. research topic: Argumentation-based Multi-agent Recommender System Host institution: Agencia Estatal del Consejo Superior de Investigaciones Científicas- CSIC (Spain), Barcelona-Spain. Required Degree: Degree in Computer Science (or similar), Master's degree in Artificial Intelligence, Computer Science or equivalent providing access to a Ph.D. program. 📆 Deadline for application: October 5, 2022 at 23h59 CET (UCT + 01:00) All details are available at: https://nl4xai.eu/open_position/esr72022-argumentation-based-multi-agent-re… NL4XAI is a European Training Network (ETN) project, which will train 11 creative, entrepreneurial, and innovative early-stage researchers (ESRs), who will face the challenge of making Artificial Intelligence (AI) self-explanatory and thus contribute to translating knowledge into products and services for economic and social benefit, with the support of Explainable AI (XAI) systems. The focus of NL4XAI is on the automatic generation of interactive explanations in natural language, just as humans naturally do, and as a complement to visualization tools. As a result, ESRs are expected to leverage the usage of AI models and techniques even by non-expert users. The NL4XAI consortium is made up of 19 partners and beneficiaries from 6 different European countries (France, Malta, Poland, Spain, The Netherlands, and the United Kingdom). The consortium is coordinated by the Research Centre in Intelligent Technologies of the Univ. of Santiago de Compostela (CiTIUS-USC) and the partners correspond to 2 national R&D centers (IIIA-CSIC and CNRS-LORIA), 11 universities (Univ. Aberdeen, TU Delft, Univ. Malta, Utrecht Univ., Univ. Twente, Univ. Lorraine, Univ. Dundee, Univ. Autònoma de Barcelona, Univ. Santiago de Compostela, Warsaw Univ. of Technology and Maastricht University) and 6 private companies (Indra, Accenture, Orange, Wizenoze, Arria, and Info Support). Each ESR will work on an individual research project in a different host institution and will participate in academic and inter-sectoral secondments at the premises of other NL4XAI members. We look for outstanding, motivated, and team-spirited candidates to carry out a Ph.D. within the NL4XAI ETN and who will get unique international and inter-sectoral training from prominent European researchers (from both academy and industry). Young researchers willing to work on Explainable AI are encouraged to apply. Please, don't hesitate to contact me for further information. Raluca Silvana Tomoni Meda Project Manager E-mail: ralucasilvana.tomoni(a)usc.es

1 0

Launch of a new journal JDIR
by Mai Zaki 16 Sep '22

16 Sep '22

Dear all, *Apologies for cross posting* It is my pleasure to announce to you the launch of a new journal by Brill, Journal of Digital Islamicate Research, where I am privileged to be the co-editor in chief. The journal aims to highlight works using the computational, visualization and big data methods to explore various issues in the emerging field of Middle Eastern and Islamic Digital Humanities. The Journal also aims to promote the study of Arabic-language and other Arabic-script DH work, in addition to Islamicate materials that are digital-born or digitally-reformatted. Please feel free to circulate among your colleagues, and visit the website for more information. We hope to see some of your contributions soon. https://brill.com/view/journals/jdir/jdir-overview.xml On behalf of the whole editorial team, Mai Zaki ******************* *Mai Zaki, Ph.D. د. مي زكي* Associate Professor of Linguistics أستاذ مشارك في اللسانيات CAS | Department of Arabic and Translation Studies قسم دراسات اللغة العربية والترجمة American University of Sharjah, UAE الجامعة الأمريكية في الشارقة، الإمارات العربية المتحدة *http://www.aus.edu <http://www.aus.edu/>* Follow me on Academia.edu <https://aus.academia.edu/DrMaiZaki> Follow me on ResearchGate <https://www.researchgate.net/profile/Mai_Zaki2>

1 0

Mini survey on EXMARaLDA annotation tools
by Thomas Schmidt 15 Sep '22

15 Sep '22

Dear list members, if you happen to be using or have used EXMARaLDA (https://exmaralda.org/en/) tools in your work, I would appreciate if you could complete this little survey: https://tinyurl.com/exmaralda-mini-survey More on the background here: https://exmaralda.org/en/2022/09/10/exmaralda-mini-survey/ Thanks, best regards, Thomas (Schmidt) ---------------- Dr. Thomas Schmidt https://orcid.org/0000-0003-0026-6450

1 0

Multilingual dictionary of phonetic spelling
by Luis Camacho Caballero 15 Sep '22

15 Sep '22

Dear colleagues I'm devoted to the revitalization and massification of the Andean Amazonian native language with computational processing as a key enabler. Among the many tasks to do, nowadays I'm dealing with the creation of neologisms. That is why I'm looking for the larger multilingual dictionary of phonetic spelling, even better if that database includes asian languages (mandarin, japanese, korean, hindi, urdu, etc). If you have this kind of database, I kindly ask you for bring me access, if you don't, I'd appreciate any clue about where and/or how access to it Kind regards Luis Camacho <https://orcid.org/0000-0001-6569-550X> ------------------------------

3 3

Call for Shared Task Proposals for SIGMORPHON 2023
by SIG MORPHON 15 Sep '22

15 Sep '22

Attention, the Special Interest Group on Computational Morphology and Phonology (SIGMORPHON) is currently soliciting proposals for Shared Tasks for the 2023 workshop. If you're interested in organizing a shared task, or know someone who might be interested, please visit https://sigmorphon.github.io/workshops/2023/call_for_tasks/ for details. Important dates: Submission of proposal: September 30, 2022 Notification of acceptance: October 15, 2022 Data ready / task begins: January 31, 2023 Workshop: TBA, Summer 2023 Any inquiries can be sent to the official SIGMORPHON e-mail address: sigmorphon(a)gmail.com Garrett Nicolai SIGMORPHON President

1 0

September 2022 Newsletter - LDC
by Penn LDC 15 Sep '22

15 Sep '22

In this newsletter: Upcoming Policy Change to LDC's Open Memberships LDC at Interspeech 2022 LanguageARC: Citizen Science for Language 30th Anniversary Highlight: Switchboard New publications: Xi'an Guanzhong Object Naming<https://catalog.ldc.upenn.edu/LDC2022S09> MASRI Synthetic<https://catalog.ldc.upenn.edu/LDC2022S08> ________________________________ Upcoming Policy Change to LDC's Open Memberships LDC is changing Its open membership year policy beginning January 1, 2023. Only one membership year will be open for joining - the current membership year. The 2022 membership year will close for joining on December 31, 2022. We expect this change to have a minimal impact on members, while allowing us to streamline our processes to serve members better. LDC's many membership benefits<https://www.ldc.upenn.edu/members/benefits> will remain the same and organizations choosing to join membership years in advance will still be able to do so. If you have any questions about this change, please don't hesitate to contact our membership office<mailto:ldc@ldc.upenn.edu>. LDC at Interspeech 2022 LDC is proud to sponsor the Workshop for Young Female Researchers in Speech<https://sites.google.com/view/yfrsw-2022/> (YFRSW) to be held in-person as an Interspeech 2022<https://interspeech2022.org/> pre-conference satellite event on September 17. Also, be sure to check out the collaborative work of LDC's Mark Liberman, "The mapping between syntactic and prosodic phrasing in English and Mandarin", presented during the On-Site Oral Session: Phonetics and Phonology on Wednesday, September 21, 13:30-15:30 KST. LanguageARC: Citizen Science for Language LanguageARC<https://languagearc.com> is a citizen science web portal for language research developed by LDC with the support of the National Science Foundation (grant #1730377). LanguageARC brings together researchers and participants from the general public interested in language to form a community dedicated to support and advance language-related research and development. Contributors to this online community can participate in a variety of language-related tasks and activities such as reading text, answering questions, describing images or video, creating or evaluating transcriptions for audio clips, or developing translations into their native languages. LanguageARC includes projects in languages other than English, such as French, Sesotho, and Swedish. Xi'an Guanzhong Object Naming LDC2022S09<https://catalog.ldc.upenn.edu/LDC2022S09>, released this month in LDC's Catalog and described below, is an example of a data set developed using LanguageARC. New projects will be added on an ongoing basis. Sign up for a LanguageARC account today to start making real contributions to language knowledge and research. Please share this information with colleagues, students, and anyone who might be interested in participating in the language activities on this website. If you are a researcher interested in creating a project on Language ARC, please reach out on the site's Contact<https://languagearc.com/messages/new> page. Find LanguageARC on Facebook at: https://www.facebook.com/languagearc 30th Anniversary Highlight: Switchboard Switchboard-1 Release 2 (LDC97S62<https://catalog.ldc.upenn.edu/LDC97S62>) is considered the first large collection of spontaneous conversational telephone speech (Graff & Bird, 2000<https://www.ldc.upenn.edu/sites/www.ldc.upenn.edu/files/lrec2000-many-uses-…>). It consists of approximately 260 hours of recordings collected by Texas Instruments in 1990-1991 (Godfrey et al., 1992<https://isip.piconepress.com/projects/switchboard/doc/education/papers/pape…>). The first release of the corpus (later superseded) was published by NIST and distributed by LDC in 1993. Participants were 543 speakers (302 male, 241 female) from across the United States who accounted for around 2,400 two-sided telephone conversations. A robot operator handled the calls, giving the caller appropriate recorded prompts, selecting and dialing another person (the callee) to take part in a conversation, introducing a topic for discussion, and recording the speech from the two subjects into separate channels until the conversation was finished. Roughly 70 topics were provided, of which about 50 were used frequently. Selection of topics and callees was constrained so that: (1) no two speakers would converse together more than once and (2) no one spoke more than once on a given topic. This gold standard data set has been used for many HLT applications, including speaker identification, speaker authentication, and speech recognition. It is considered one of the most important benchmarks for recognition tasks involving large vocabulary conversational speech (Deshmukh et al., 1998<https://www.researchgate.net/profile/Aravind-Ganapathiraju/publication/2214…>) as well as a key resource for studying the phonetic properties of spontaneous speech (Greenberg et al., 1996<https://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.65.3655&rep=rep1&…>). Annotation tasks based on Switchboard include discourse tags/speech acts, part-of-speech tagging and parsing, and sentiment analysis<https://catalog.ldc.upenn.edu/LDC2020T14>. The Switchboard series includes Switchboard Credit Card<https://catalog.ldc.upenn.edu/LDC93S8>, Phase II<https://catalog.ldc.upenn.edu/LDC98S75>, Phase III<https://catalog.ldc.upenn.edu/LDC2002S06>, the Switchboard Cellular<https://catalog.ldc.upenn.edu/LDC2001S13> collection, and new recordings from 18 Switchboard participants in the 2013 Greybeard<https://catalog.ldc.upenn.edu/LDC2013S05> corpus. All Switchboard corpora are available in the Catalog for licensing by Consortium members and non-members. Visit Obtaining Data<https://www.ldc.upenn.edu/language-resources/data/obtaining> for more information. ________________________________ New publications: Xi'an Guanzhong Object Naming<https://catalog.ldc.upenn.edu/LDC2022S09> is comprised of 15 hours of audio recordings from speakers of the Guanzhong dialect of Mandarin Chinese living in or near Xi'an in Shaangxi Province (China) naming objects that appeared in colored line drawings. The corpus was developed to support traditional and computer aided language documentation. The collection was conducted from February-May 2021 using LanguageARC<https://languagearc.com/>, a citizen science portal developed by LDC, from a closed volunteer community. Speakers were presented with images selected from the MultiPic dataset<https://www.bcbl.eu/databases/multipic> and were asked to record themselves naming the objects in the images. Xi'an Guanzhong Object Naming is distributed via web download. 2022 Subscription Members will automatically receive copies of this corpus. 2022 Standard Members may request a copy as part of their 16 free membership corpora. Non-members may license this data for a fee. * MASRI Synthetic<https://catalog.ldc.upenn.edu/LDC2022S08> MASRI (Maltese Automatic Speech Recognition I) Synthetic was developed by the MASRI team <https://www.um.edu.mt/projects/masri/> at the University of Malta<https://www.um.edu.mt/> and contains 99 hours of synthesized Maltese speech. Source sentences were extracted from the Maltese Language Resource Server<https://mlrs.research.um.edu.mt/index.php?page=corpora> (MLRS) corpus, comprised of written or transcribed Maltese covering various genres, including parliamentary debates, news, law, opinion, sports, culture, academic, literature, and religious texts. Text was processed through the CrimsonWing text-to-speech system to generate speech files. Synthesized speech was created with 210 voices (105 female, 105 male). MASRI Synthetic is distributed via web download. 2022 Subscription Members will automatically receive copies of this corpus provided they have submitted a completed copy of the special license agreement. 2022 Standard Members may request a copy as part of their 16 free membership corpora. Non-members may license this data for a fee. Membership Coordinator Linguistic Data Consortium<ldc.upenn.edu> University of Pennsylvania T: +1-215-573-1275 E: ldc(a)ldc.upenn.edu<mailto:ldc@ldc.upenn.edu> M: 3600 Market St. Suite 810 Philadelphia, PA 19104

1 0

Second Call for Papers Global WordNet Conference 2023
by Begoña Altuna 15 Sep '22

15 Sep '22

Second Call for Papers Global WordNet Conference 2023 [Apologies for cross posting] Call for Papers 12th International Global Wordnet Conference Donostia / San Sebastian, Basque Country January 23-27, 2023 Global Wordnet Association: www.globalwordnet.org Conference website: https://hitz.eus/gwc2023 The Global Wordnet Association is pleased to announce the 12th International Global Wordnet Conference (GWC2023) in Donostia / San Sebastian (Spain) hosted by HiTZ, Basque Center for Language Technology at the University of the Basque Country. NOTE: COVID-19 allowing, the conference will be in person only. Organisers: Begoña Altuna, Itziar Aldabe, Xabier Arregi, Itziar Gonzalez-Dios, Aritz Farwell and Esther Miranda. Details about the Association and the full announcement for the conference can be found on the conference website: https://hitz.eus/gwc2023 We invite submissions with original contributions addressing, but not limited to, the topics listed below. Proposals for tutorials are welcome as well. Conference Topics 1. Lexical semantics and meaning representation * Critical analysis and applications of lexical and semantic relations * Proposed new relations * Definitions, semantic components, co-occurrence and frequency statistics * Word, Sense and Context Embeddings * Necessity and completeness issues * Ontology and wordnet * Other lexicographical and lexicological questions pertaining to wordnet-style meaning representation * Wordnets and other modalities 1. Architecture of lexical databases * Language independent and language dependent components * Integration of multi-wordnets in research infrastructures (like CLARIN, ELG, etc.) * Wordnets and Linked Open Data (LOD) 1. Tools and methods for wordnet development * User and Data entry interfaces * Methods for constructing, extending and enriching wordnets * Methods for linking wordnets to other lexical and semantic resources * Methods for leveraging existing wordnets and semantic networks with large language models 1. Applications of wordnet * Word sense disambiguation * Text generation * Commonsense reasoning * Machine translation * Information extraction and retrieval * Document structuring and categorisation * Automatic hyperlinking * Language pedagogy * Psycholinguistic applications * Embeddings and pretrained language models * Probing large neural language models 1. Standardization, distribution and availability of wordnets and wordnet tools. Submissions will fall into one of the following categories (page limits exclude references): * long papers: 8 pages max, 30 minute presentation * short papers: 5 pages max; 15 minute presentation * project reports: 5 pages max., 10 minute presentation * demonstrations : 5 pages max, with an additional 3 pages screen dumps or images; 20 minute presentation Submissions should be anonymous and any identifying information must be removed. Authors must state the preferred category, though acceptance may be subject to change in the category of the presentation, e.g. a long paper submission may be accepted as a short paper. Final papers should be submitted in electronic form (PDF only). Paper submissions must use the official ACL style templates, which are available from here <https://github.com/acl-org/acl-style-files> (Latex and Word). Please follow the paper formatting guidelines general to “*ACL” conferences available here <https://acl-org.github.io/ACLPUB/formatting.html>. Authors may not modify these style files or use templates designed for other conferences. Submission site: https://easychair.org/conferences/?conf=gwc2023 Important Dates (NEW DATES!!!) 1. September 30, 2022 Deadline for abstract submission 2. October 14, 2022 Deadline for paper submission 1. November 18, 2022 Notification of acceptance 1. December 1, 2022 Registration opens 1. December 23, 2022 Deadline author registration, final version paper 1. January 23-27, 2023 Conference Proceedings Conference proceedings will be open access and downloadable from the GWA website. The proceedings will have an ISBN and be published in the ACL anthology. Papers are only included in the proceedings if at least one author has registered. Inclusion of accepted submissions into the final program and the proceedings is contingent upon at least one author’s registration. Late registration and on-site registration for participants is possible without inclusion of the paper and without presentation. Conference Chairs German Rigau - german.rigau(a)ehu.eus Francis Bond - bond(a)ieee.org Local Organizing Chairs Begoña Altuna - begona.altuna(a)ehu.eus Itziar Aldabe - itziar.aldabe(a)ehu.eus Xabier Arregi - xabier.arregi(a)ehu.eus Itziar Gonzalez-Dios - itziar.gonzalezd(a)ehu.eus Aritz Farwell - asfarwell(a)ehu.eus Esther Miranda - esther.miranda(a)ehu.eus Program Committee (to be confirmed and extended) Adam Pease, Articulate Software Ales Horak, Masaryk University Alexandre Rademaker, IBM Research Brazil and EMAp/FGV Bolette Pedersen, University of Copenhagen Christiane Fellbaum, Princeton University Darja Fiser, University of Ljubljana David Lindemann, IWiSt, University of Hildesheim Diptesh Kanojia, IIT Bombay Eneko Agirre, University of the Basque Country Ewa Rudnicka, Wrocław University of Technology Francis Bond, Palacký University Gerard De Melo, Rutgers University German Rigau, IXA Group, UPV/EHU Haldur Oim, University of Tartu Heili Orav, University of Tartu Hugo Gonçalo-Oliveira, Department of Informatics Engineering of the University of Coimbra Janos Csirik, University of Szeged John Mccrae, National University of Ireland, Galway Kadri Vider, University of Tartu Kevin Scannell, Saint Louis University Kyoko Kanzaki, Otemon Gakuin University Maciej Piasecki, Department of Computational Intelligence, Wroclaw University Marten Postma, Vrije Universiteit Amsterdam Paul Buitelaar, National University of Ireland, Galway Piek Vossen, VU University Amsterdam. Sanni Nimb, The Danish Society for Language and Literature Shan Wang, The Education University of Hong Kong Shu-Kai Hsieh, National Taiwan Normal University Sonja Bosch, Department of African Languages, University of South Africa Thierry Declerck, DFKI, Saarbruecken Tim Baldwin, The University of Melbourne Tomaž Erjavec, Dept. of Knowledge Technologies, Jožef Stefan Institute Umamaheswari Vasanthakumar, Nanyang Technological University Valeria Depaiva, Natural Language and AI Research Laboratory of Nuance Communications, Inc. Verginica Mititelu, Romanian Academy Research Institute for Artificial Intelligence Sponsors Keler https://www.keler.eus/en

1 0

CASE 2022, shared tasks on event extraction
by Hristo Tanev 15 Sep '22

15 Sep '22

Dear NLP Researchers, For the 5th consecutive year the Emerging Market Welfare Project and the Europeam Commission Joint Research Centre would like to invite you to test your event extraction systems at the three shared tasks in The 5th Workshop on Challenges and Applications of Automated Extraction of Socio-political Events from Text. (CASE 2022). The shared tasks feature detection of politically-motivated conflict events, especially detection of protests and riots. You can use the occasion to test your system in real-life text collection. Please, follow the link to the workshop and register, if you want to participate: https://lnkd.in/dbGp2jRe The organizers of CASE 2022

1 0

Conference: Literary Machine Translation as a Human-Machine Dialectic (ULiège)
by Damien Hansen 14 Sep '22

14 Sep '22

[Apologies for cross-posting] *Literary Machine Translation as a Human-Machine Dialectic* International conference Thursday, 6 October 2022 University of Liège, Belgium Program and registration: https://www.cirti.uliege.be/litMT2022 The Centre interdisciplinaire de recherches en traduction et en interprétation (CIRTI, University of Liège) will hold a one-day symposium centred around the topic of literary machine translation and dialogue between human and machine as a potential computer-assisted literary creation tool. • Can we re-imagine and use machine translation tools for creative means? • How do we train systems adapted to the literary domain? • What impact would it have on creativity, quality and translators' voice? • What are the ethical challenges brought about by new technologies? Join us for a day of exchanges, presentations and round tables as we tackle these questions. All best, The Organizing Team --- *Damien Hansen* Université de Liège - Université Grenoble Alpes CIRTI - LIG/GETALP - LGL

1 0

multi-modal research opening in Cambridge, UK
by Svetlana Stoyanchev 14 Sep '22

14 Sep '22

The Speech Technology Group of Toshiba Europe LTD in Cambridge has opening for a researcher to work on multi-modal interfaces. The position offers researchers to work with an interdisciplinary team focussing on both speech and vision modalities. We are looking for candidates with a PhD or Masters with deep learning experience who will contribute to advancing multi-modal research and building prototype systems. Please check here for more details: https://careers.toshiba.eu/displayjob.aspx?jobid=351 ----------------------- Dr Svetlana Stoyanchev Speech Technology Group, Cambridge Research Lab, Toshiba Europe Limited https://www.linkedin.com/in/svetlana-stoyanchev/

1 0

2026

2025

2024

2023

2022

Corpora September 2022