May 2023 - Corpora - ELRA lists

[First CfP] Taming Large Language Models: Controllability in the era of Interactive Assistants -- Workshop at INLG 2023
by Devamanyu Hazarika 22 May '23

22 May '23

First Call For Submissions Welcome to the 1st Workshop on Taming Large Language Models: Controllability in the era of Interactive Assistants! This workshop aims to unite esteemed scholars, researchers, and practitioners specializing in Natural Language Generation (NLG). This event will foster in-depth discussions and explorations of the challenges and prospects associated with content control in LLMs. Emphasizing the intersection of NLG research and the instruction-learning paradigm, the workshop will serve as a platform for fruitful collaborations and knowledge exchange. This hybrid workshop will be co-located with INLG 2023 ( https://inlg2023.github.io/workshops.html) at Prague. ------------------------------ Important Dates - Submission deadline: June 15, 2023 - Author notification: July 21, 2023 - Camera-ready deadline: August 14, 2023 - Workshop date: September 12, 2023 Submission Portal: https://softconf.com/n/tllm2023 Website: https://ctrlnlg.github.io/ **All deadlines are 11.59 pm AOE time. ------------------------------ Topics We welcome submissions on one or more of the following topics: - Alignment: Investigating techniques to better align LLMs with human values and intentions, including reward modeling, human-in-the-loop systems, and quantifying alignment metrics. Understanding the objectives pursued by a model and aligning them with human preferences are key challenges. We encourage research on methods to increase alignments, such as through prompt design and fine-tuning. - In-context Learning: Exploring the role of context in LLMs, including how to improve context understanding, manage context drift, and enhance context-aware responses. Also, investigating the use of in-context learning as a control mechanism. - Instruction-based Control: Comparing popular controlling mechanisms, including approaches such as logit manipulation, decoder mixing, and classifier guidance, amongst others, against the simpler instruction-based control. - Generality: Investigating controllable techniques that work across tasks and datasets. - Safety and Robustness: Assessing potential risks and vulnerabilities in LLMs, along with solutions such as adversarial training, safe exploration, and monitoring model behavior during deployment. - Controllability vs. Robustness: Developing methods to better understand LLMs' decision-making processes, and how it acts in grounded scenarios. Understanding its reliance on implicit vs. explicit memory. - Scalability and efficiency: Investigating novel approaches for reducing computational requirements for achieving control in LLMs. - Real-world applications and case studies: Showcasing successful LLM deployments in various fields, such as healthcare, finance, education, and creative industries, along with lessons learned and future opportunities. Submissions We welcome reports of original research in the form of two types: - Long papers (8 pages + references) - Short papers (4 pages + references) We encourage all authors to include relevant discussions of ethical considerations and impact in the body of the paper. Submissions will be made via SoftConf/START: https://softconf.com/n/tllm2023 <https://softconf.com/n/icard2023/> Submission Format - The proceedings will be published by ACL Anthology. - All long, short, and abstract submissions must follow the two-column ACL format , which are available as an Overleaf template <https://www.overleaf.com/read/crtcwgxzjskr> and also downloadable directly <https://github.com/acl-org/acl-style-files> (Latex and Word). Please refer to the SIGDIAL 2023 website for the most recent version of the templates. - Submissions must conform to the official ACL style guidelines, which are contained in these templates. Submissions must be electronic, in PDF format. - All submissions should be anonymized to facilitate double blind reviewing. - Submissions that do not adhere to the author guidelines or ACL policies <https://www.aclweb.org/adminwiki/index.php?title=ACL_Author_Guidelines> will be rejected without review. - Appendix should be added in the main document after references. Appendix does not count towards the page length. Any questions regarding submissions can be sent to tamingllm-workshop(a)googlegroups.com

1 0

IACT’23@SIGIR: Human or AI? Last CFP
by Hugo Oliveira Sousa 22 May '23

22 May '23

*** Apologies for cross-posting *** ++ CFP: LAST CHANCE TO SUBMIT ++ ============================================================================================================================= The 1st International Workshop on Implicit Author Characterization from Texts for Search and Retrieval (IACT’23) The workshop will be held in conjunction with the 46th International ACM SIGIR Conference on Research and Development in Information Retrieval. Workshop website: https://en.sce.ac.il/news/iact23 Date: July 27, 2023 Location: Taipei, Taiwan. Paper submission deadline: Extended to May 23, 2023, AoE Submission link: https://easychair.org/conferences/?conf=iact23 To bring the research community's attention to the limitations of current models in recognizing and characterizing AI vs. human authors, we organize the first edition of IACT workshops under the umbrella of the SIGIR conference. Research works submitted to the workshop should foster scientific advances in all aspects of author characterization. Submission Guidelines: All papers must be original and not simultaneously submitted to another journal or conference. The following paper categories are welcome: - Full research papers: up to 8 pages. Original and high-quality unpublished contributions to the theory and practical aspects of the workshop topics. - Short research papers: up to 5 pages. It can describe ongoing research, resources, and demos. - Negative results papers: up to 5 pages. Highlighting tested hypotheses that did not get the expected outcome is also welcomed. - Position papers: up to 5 pages. Discussing current and future research directions. The length constraints do not include references. The submissions must be anonymous and will be peer-reviewed by at least two program committee members. Workshop Format: The authors of accepted papers will be given 15 minutes for a short oral presentation. The workshop will run as a hybrid event to allow virtual attendance and meet the SIGIR format. Workshop Topics: Research works submitted to the workshop should foster the scientific advance on all aspects of implicit author information extraction from text, including but not limited to the following: - Differentiation between AI-generated content and human-generated content and bot profiling - Characterization of conversational agents - Feature detection of authors for human vs. AI determination - Prompt understanding and recognition in language models - Personalized question-answering and conversation generation - Troll identification on social media - Review authenticity estimation - Multi-modal, multi-genre, and multilingual author analysis - Character analysis, description, and representation in narrative texts - Detecting implicit expressions of sentiment, emotion, opinion, and bias - Transfer learning for implicit author characterization - Implicit author characterization annotation schema - Evaluation of implicit author characterization - Author characterization in low-resource languages and under-studied domains - Accountability and regulation of AI-based information extraction, retrieval, and content generation - Copyright issues of AI-generated content - Ethical and privacy implications of author characterization and implicit information extraction - Fairness and bias of AI-generated content Organizing Committee: Marina Litvak - marinal(a)ac.sce.ac.il; Shamoon College of Engineering Beer Sheva; Israel Irina Rabaev - irinar(a)ac.sce.ac.il; Shamoon College of Engineering Beer Sheva; Israel Alípio Mário Jorge - amjorge(a)fc.up.pt; University of Porto; Porto, Portugal Ricardo Campos - ricardo.campos(a)ipt.pt; Polytechnic Institute of Tomar INESC TEC, Portugal; Porto, Portugal Adam Jatowt - adam.jatowt(a)uibk.ac.at; University of Innsbruck; Innsbruck, Austria Invited Speakers: Prof. Mark Last - Ben-Gurion University of the Negev, Israel Prof. Dr. Valia Kordoni - Humboldt-Universität Berlin, Germany Contacts: Dr. Marina Litvak: litvak.marina(a)gmail.com Dr. Irina Rabaev: irinar(a)ac.sce.ac.il All the best, Hugo Sousa on behalf of the IATC'23 Organizers

1 0

Early Registration Extension 28th May - IWCS 2023
by Maxime Amblard 22 May '23

22 May '23

**apologies for cross-postings** ===== Call for Participation - IWCS 2023 ===== Early Registration Extension: -> 28th May 2023 https://iwcs2023.loria.fr/registration/ ============================================ 15th International Conference on Computational Semantics (IWCS) Universit�� de Lorraine, Nancy, France 20-23th June 2023 http://iwcs2023.loria.fr/ IWCS is the biennial meeting of SIGSEM [1], the ACL special interest group on semantics [2]; this year's edition is organized in person by the Loria [3] and IDMC [4] of the Universit�� de Lorraine. [1] http://sigsem.org/ [2] http://aclweb.org/ [3] https://www.loria.fr/fr/ [4] http://idmc.univ-lorraine.fr/ The aim of the IWCS conference is to bring together researchers interested in any aspects of the computation, annotation, extraction, representation and neuralisation of meaning in natural language, whether this is from a lexical or structural semantic perspective. IWCS embraces both symbolic and machine learning approaches to computational semantics, and everything in between. The conference and workshops will take place 20-23 June 2023. === TOPICS OF INTEREST === We invite paper submissions in all areas of computational semantics, in other words all computational aspects of meaning of natural language within written, spoken, signed, or multi-modal communication. Presentations will be oral and posters. Submissions are invited on these closely related areas, including the following: * design of meaning representations * syntax-semantics interface * representing and resolving semantic ambiguity * shallow and deep semantic processing and reasoning * hybrid symbolic and statistical approaches to semantics * distributional semantics * alternative approaches to compositional semantics * inference methods for computational semantics * recognising textual entailment * learning by reading * methodologies and practices for semantic annotation * machine learning of semantic structures * probabilistic computational semantics * neural semantic parsing * computational aspects of lexical semantics * semantics and ontologies * semantic web and natural language processing * semantic aspects of language generation * generating from meaning representations * semantic relations in discourse and dialogue * semantics and pragmatics of dialogue acts * multimodal and grounded approaches to computing meaning * semantics-pragmatics interface * applications of computational semantics === SUBMISSION INFORMATION === Two types of submission are solicited: long papers and short papers. Both types should be submitted not later than 3 March (anywhere on earth). Long papers should describe original research and must not exceed 8 pages (not counting acknowledgements and references). Short papers (typically system or project descriptions, or ongoing research) must not exceed 4 pages (not counting acknowledgements and references). Both types will be published in the conference proceedings and in the ACL Anthology. Accepted papers get an extra page in the camera-ready version. Style-files: IWCS papers should be formatted following the common two-column structure as used by ACL. Please use our specific style-files or the Overleaf template, taken from ACL 2021. Similar to ACL 2021, initial submissions should be fully anonymous to ensure double-blind reviewing. Submitting: Papers should be submitted in PDF format via Softconf: https://softconf.com/iwcs2023/papers Please make sure that you select the right track when submitting your paper. Contact the organisers if you have problems using Softconf. No anonymity period IWCS 2023 does not have an anonymity period. However, we ask you to be reasonable and not publicly advertise your preprint during (or right before) review. === IMPORTANT DATES === 22 March 2023 (anywhere on earth) Paper submissions 17 April 2023 Decisions sent to authors 20-23 June 2023 IWCS conference === CONTACT === For questions, contact: iwcs2023-contact(a)univ-lorraine.fr Maxime Amblard, Ellen Breithloltz (the IWCS 2023 organizers)

1 0

TOC: Journal of Language Modelling 10(2)
by Adam Przepiórkowski 22 May '23

22 May '23

It is our pleasure to announce the publication of issue 10(2) of the Journal of Language Modelling (JLM), a free open-access peer-reviewed journal aiming to bridge the gap between theoretical, formal and computational linguistics: http://jlm.ipipan.waw.pl/ (see “CURRENT” or “ALL ISSUES”). The direct persistent link to this issue is: http://jlm.ipipan.waw.pl/index.php/JLM/issue/view/28. JLM is indexed by SCOPUS, ERIH PLUS, DBLP, DOAJ, etc., and it is a member of OASPA. TABLE OF CONTENTS: Articles: “Idiosyncratic frequency as a measure of derivation vs. inflection” Maria Copot, Timothee Mickus, Olivier Bonami 193–240 “Simplicity and learning to distinguish arguments from modifiers” Leon Bergen, Edward Gibson, Timothy J. O'Donnell 241–286 “Neural heuristics for scaling constructional language processing” Paul Van Eecke, Jens Nevens, Katrien Beuls 287–314 Acknowledgments: “External Reviewers 2019–2022” 315–318 The current make-up of the JLM Editorial Board is enclosed below. Best regards, Adam Przepiórkowski (JLM Editor-in-Chief) ====================================================================== EDITORIAL BOARD: Steven Abney, University of Michigan, USA Ash Asudeh, University of Rochester, USA Chris Biemann, Universität Hamburg, GERMANY Igor Boguslavsky, Technical University of Madrid, SPAIN; Institute for Information Transmission Problems, Russian Academy of Sciences, Moscow, RUSSIA António Branco, University of Lisbon, PORTUGAL David Chiang, University of Southern California, Los Angeles, USA Greville Corbett, University of Surrey, UNITED KINGDOM Dan Cristea, University of Iași, ROMANIA Jan Daciuk, Gdańsk University of Technology, POLAND Mary Dalrymple, University of Oxford, UNITED KINGDOM Darja Fišer, University of Ljubljana, SLOVENIA Anette Frank, Universität Heidelberg, GERMANY Claire Gardent, CNRS/LORIA, Nancy, FRANCE Jonathan Ginzburg, Université Paris-Diderot, FRANCE Stefan Th. Gries, University of California, Santa Barbara, USA Heiki-Jaan Kaalep, University of Tartu, ESTONIA Laura Kallmeyer, Heinrich-Heine-Universität Düsseldorf, GERMANY Jong-Bok Kim, Kyung Hee University, Seoul, KOREA Kimmo Koskenniemi, University of Helsinki, FINLAND Jonas Kuhn, Universität Stuttgart, GERMANY Alessandro Lenci, University of Pisa, ITALY Ján Mačutek, Comenius University in Bratislava, SLOVAKIA Igor Mel’čuk, University of Montreal, CANADA Glyn Morrill, Technical University of Catalonia, Barcelona, SPAIN Stefan Müller, Humboldt Universität zu Berlin, GERMANY Mark-Jan Nederhof, University of St Andrews, UNITED KINGDOM Petya Osenova, Sofia University, BULGARIA David Pesetsky, Massachusetts Institute of Technology, USA Maciej Piasecki, Wrocław University of Technology, POLAND Christopher Potts, Stanford University, USA Louisa Sadler, University of Essex, UNITED KINGDOM Agata Savary, Université François Rabelais Tours, FRANCE Sabine Schulte im Walde, Universität Stuttgart, GERMANY Stuart M. Shieber, Harvard University, USA Mark Steedman, University of Edinburgh, UNITED KINGDOM Stan Szpakowicz, School of Electrical Engineering and Computer Science, University of Ottawa, CANADA Shravan Vasishth, Universität Potsdam, GERMANY Zygmunt Vetulani, Adam Mickiewicz University, Poznań, POLAND Aline Villavicencio, Federal University of Rio Grande do Sul, Porto Alegre, BRAZIL Veronika Vincze, University of Szeged, HUNGARY Yorick Wilks†, Florida Institute of Human and Machine Cognition, USA Shuly Wintner, University of Haifa, ISRAEL Zdeněk Žabokrtský, Charles University in Prague, CZECH REPUBLIC ====================================================================== Adam Przepiórkowski ˈadam ˌpʃɛpjurˈkɔfskʲi http://clip.ipipan.waw.pl/ ____ Computational Linguistics in Poland http://jlm.ipipan.waw.pl/ ___________ Journal of Language Modelling http://zil.ipipan.waw.pl/ ____________ Linguistic Engineering Group http://nkjp.pl/ _________________________ National Corpus of Polish

1 0

2 researcher positions at the University of Cambridge, NLP / ML for education technology
by Andrew Caines 22 May '23

22 May '23

The Cambridge Institute for Automated Language Teaching and Assessment (ALTA) is seeking two Research Assistants or Research Associates in Natural Language Processing and Machine Learning to join their strong team of researchers in the Department of Computer Science and Technology. ALTA is a virtual institute which brings together researchers from Computer Science, Engineering, Linguistics and Language Assessment to investigate new ways of using technology to enhance language learning and to develop cutting-edge approaches to assessment which will benefit learners and teachers worldwide. The successful applicant will be working in EdTech based on LLM technology and will focus on at least one of the following areas: automated assessment of language learners, explainable models of assessment, learning-content generation, or adaptive learning. In all cases, the candidate must have a directly relevant PhD (or must be close to completion). The candidate is expected to have knowledge and experience of computational techniques relevant to natural language processing and machine learning, including an understanding and experience with pre-trained language models. The candidate will need to be confident communicating in cross-disciplinary forums. Further information: https://www.jobs.cam.ac.uk/job/40995/

1 0

extended deadline: PhD and Postdoc position in Cognitive Modelling
by Vera Demberg 22 May '23

22 May '23

We are inviting applications for one PhD position (3 years) and a postdoctoral position (funding available 09/2023-01/2026)of acomputer scientist,computational linguistor psycholinguist,who has experience withor interest incognitive modelling for language processing (e.g., Bayesian models, and/or modelsusing cognitive architectureslike ACT-R). Thepositions will be funded as part of theERC Starting Grant "Individualized Interaction in Discourse" ofProf. Vera Demberg, at Saarland University. The goal of the position is todevelop models that capture individualdifferences in discourse and pragmatic processing. The candidatewill conduct research on the design andimplementation of cognitive models of language processingat the level of discourse and/or pragmatic processing. These models should capture individual differences in cognitionsuch as working memory, language experience, background knowledge, theory-of-mind abilities etc. The successful applicant must have excellent spoken and writtenproficiency inEnglish, and have a background in natural language processing or cognitive modelling. Applicants are requested to submit their application, including a cover letter that specifies why you would like to workon this topic and what qualifies you for it, an academic CV, a list of academic publications, yourBSc/ MSc / PhDthesis (ora current draft), copies of academic degree certificates and names of two potential references. For application to the Postdoctoral position, please quote opening number W2290, for the PhD position please quote opening number W2289. The applicationsshould be sent via emaildirectly to Prof. Vera Demberg: vera(at)coli.uni-saarland.de *The**(extended) **application deadline is**June**4th****, 202**3* Saarland University is one of the leading centres for computer science and computational linguistics in Europe, and offers adynamic and stimulating research environment.The groupis affiliated with both theDepartment of Computer Scienceand withtheDepartment of Language Scienceand Technology. Both departments are part of the Saarland Informatics Campus, which brings together 800 researchers and 2000 students from 81countries. We collaborate closely with the university's Department of Computer Science, the Max Planck Institute for Informatics,the Max Planck Institute for Software Systems, and the German Research Center for Artificial Intelligence (DFKI). Our researchers and students come from all over the world, and our primary working language is English. The Saarland University is an equal opportunities employer. In accordance with its policy of increasing the proportion of womenin this type of employment, the University actively encourages applications from women. Women are given preference in cases ofequal suitability, ability and professional performance. Applications from severely disabled persons will be given preferential consideration in the event of equal suitability. Part-timeemployment is generally possible. We welcome applications regardless of gender,nationality, ethnic and social origin, religion/belief, disability, age, and sexualorientation and identity. Pay grade classification is based on the particular details of the position held and the extent to which the applicant meetstherequirements of the pay grade within the TV-L salary scale. Unfortunately, costs for attending an interview at Saarland University cannot be reimbursed in principle. When you submit a job application to Saarland University you will be transmitting personal data.Please refer to our privacy noticefor information on how we collect and process personal data in accordance with Art.13 of theDatenschutz-Grundverordnung.Bysubmitting your application you confirm that you have taken note of the information in the Saarland University privacy notice.************ -- Prof. Dr. Vera Demberg Computer Science and Computational Linguistics Saarland Informatics Campus Saarland University Campus C7.2 Room 3.02 D-66123 Saarbrücken Phone: +49-681-302-70024 Sekretariat: +49-681-302-70025 Fax: +49-681-302-70026 -- You received this message because you are subscribed to the Google Groups "XPRAG Wine Gatherings" group. To unsubscribe from this group and stop receiving emails from it, send an email to xpragwine+unsubscribe(a)googlegroups.com. To view this discussion on the web visit https://groups.google.com/d/msgid/xpragwine/f9d60192-0120-d71e-a864-1d1fe9d… <https://groups.google.com/d/msgid/xpragwine/f9d60192-0120-d71e-a864-1d1fe9d…>. For more options, visit https://groups.google.com/d/optout.

1 0

[Computers] Invitation to Publish a Paper on "When Natural Language Processing Meets Machine Learning— Opportunities, Challenges and Solutions"
by Mr. Blink Yu 22 May '23

22 May '23

***** apologies for multiple posting **** Dear Colleagues, We are running a special issue on "When Natural Language Processing Meets Machine Learning— Opportunities, Challenges and Solutions" in journal Computers (ISSN 2073-431X, https://www.mdpi.com/journal/computers ). Dr. Lu Bai from Ulster University, Belfast BT15 1ED, UK, Prof. Dr. Huiru Zheng from Ulster University, Belfast BT15 1ED, UK, and Dr. Zhibao Wang from Northeast Petroleum University, Daqing 163318, China are editing this special issue. We are writing to invite you to contribute a paper to be published in *open access* form for this Special Issue. The combination of Natural Language Processing (NLP) and Machine Learning (ML) has led to many advancements in the field of artificial intelligence, enabling computers to understand and analyse human language. NLP focuses on the interactions between human language and computers, while ML provides algorithms and techniques to make predictions and automate tasks based on data. The opportunities presented by this combination include improved text classification, sentiment analysis, machine translation, and question-answering systems. However, the integration of NLP and ML still faces several challenges, such as the need for large amounts of annotated data for training, handling the complexity and variability of human language, and ensuring the ethical and fair use of AI systems. To overcome these challenges, NLP and ML researchers are exploring innovative solutions such as transfer learning, semi-supervised learning, and unsupervised learning methods, as well as developing techniques to handle unstructured and diverse data. Additionally, there is a growing emphasis on ensuring the accountability, transparency, and ethical use of AI systems. For more details, please visit the special issue website: https://www.mdpi.com/journal/computers/special_issues/549S2G1BUP The manuscript submission deadline is *31 December 2023*. You could send your manuscript earlier or up until the deadline. Papers will be reviewed upon receipt and published on an ongoing basis. Extended conference papers are also welcome. They should contain at least 50% of new material, e.g., in the form of technical extensions, more in-depth evaluations, or additional use cases. Computers (http://www.mdpi.com/journal/computers) is a fully open access journal of computer science, published quarterly online by MDPI, Switzerland. It is covered by the Scopus (Elsevier, 2018 CiteScore: 3.7), ESCI (Web of Science), INSPEC and DBLP.Manuscripts are peer-reviewed and a first decision provided to authors approximately *16.5* days after submission; acceptance to publication is undertaken in *3.8* days. For further details on the submission process, please see the instructions for authors at the journal website https://www.mdpi.com/journal/computers/instructions. In the hope that this invitation receives your favorable consideration, we look forward to our future collaboration. a Dr. Lu Bai from Ulster University, Belfast BT15 1ED, UK Prof. Dr. Huiru Zheng from Ulster University, Belfast BT15 1ED, UK Dr. Zhibao Wang from Northeast Petroleum University, Daqing 163318, China -- Mr. Blink Yu Managing Editor E-Mail: blink.yu(a)mdpi.com Skype: live:c91693ac8277e1f0 -- MDPI Wuhan Office No.6 Jingan Road, 430064 Wuhan, China http://www.mdpi.com -- Disclaimer: MDPI recognizes the importance of data privacy and protection. We treat personal data in line with the General Data Protection Regulation (GDPR) and with what the community expects of us. The information contained in this message is confidential and intended solely for the use of the individual or entity to whom they are addressed. If you have received this message in error, please notify me and delete this message from your system. You may not copy this message in its entirety or in part, or disclose its contents to anyone.

1 0

[Jobs] 3 Full professorships at the Tübingen AI Center
by lynn.anthonissen＠uni-tuebingen.de 22 May '23

22 May '23

The Tübingen AI Center is inviting applications for the following tenured positions: - Full Professor of Machine Learning and Intelligent Systems (2 positions) - Full Professor of ML Engineering and Technology Transfer The Tübingen AI Center is a research institution hosted by the University of Tübingen in cooperation with the Max Planck Institute for Intelligent Systems whose core machine learning faculties work together to develop more robust, efficient and accountable learning systems. Embedded in Tübingen's rapidly growing science and technology campus, the Tübingen AI Center has close ties with the newly established ELLIS Institute Tübingen, and more generally cooperates closely with the pan-European ELLIS network as well as the Cyber Valley initiative, which connects researchers with start-ups and industry in the area. Details about the positions and how to apply can be found at https://tuebingen.ai/careers. Applications should be submitted by June 14, 2023. Inquiries about the position may be directed to the Central Office of the Tübingen AI Center (lynn.anthonissen(a)uni-tuebingen.de).

1 0

CoCo4MT 2023 @ MT SUMMIT First Call for Papers and Shared Task
by John Ortega 21 May '23

21 May '23

The Second Workshop on Corpus Generation and Corpus Augmentation for Machine Translation (CoCo4MT) @MT-SUMMIT XIX The 19th Machine Translation Summit Sep 4-8, 2023, Macau SAR, China https://sites.google.com/view/coco4mt SCOPE It is a well-known fact that machine translation systems, especially those that use deep learning, require massive amounts of data. Several resources for languages are not available in their human-created format. Some of the types of resources available are monolingual, multilingual, translation memories, and lexicons. Those types of resources are generally created for formal purposes such as parliamentary collections when parallel and more informal situations when monolingual. The quality and abundance of resources including corpora used for formal reasons is generally higher than those used for informal purposes. Additionally, corpora for low-resource languages, languages with less digital resources available, tends to be less abundant and of lower quality. CoCo4MT is a workshop centered around research that focuses on manual and automatic corpus creation, cleansing, and augmentation techniques specifically for machine translation. We accept work that covers any language (including sign language) but we are specifically interested in those submissions that explicitly report on work with languages with limited existing resources (low-resource languages). Since techniques from high-resource languages are generally statistical in nature and could be used as generic solutions for any language, we welcome submissions on high-resource languages also. CoCo4MT aims to encourage research on new and undiscovered techniques. We hope that the methods presented at this workshop will lead to the development of high-quality corpora that will in turn lead to high-performing MT systems and new dataset creation for multiple corpora. We hope that submissions will provide high-quality corpora that are available publicly for download and can be used to increase machine translation performance thus encouraging new dataset creation for multiple languages that will, in turn, provide a general workshop to consult for corpora needs in the future. The workshop’s success will be measured by the following key performance indicators: - Promotes the ongoing increase in quality of machine translation systems when measured by standard measurements, - Provides a meeting place for collaboration from several research areas to increase the availability of commonly used corpora and new corpora, - Drives innovation to address the need for higher quality and abundance of low-resource language data. Topics of interest include: - Difficulties with using existing corpora (e.g., political considerations or domain limitations) and their effects on final MT systems, - Strategies for collecting new MT datasets (e.g., via crowdsourcing), - Data augmentation techniques, - Data cleansing and denoising techniques, - Quality control strategies for MT data, - Exploration of datasets for pretraining or auxiliary tasks for training MT systems. SHARED TASK To encourage research on corpus construction for low-resource machine translation, we introduce a shared task focused on identifying high-quality instances that should be translated into a target low-resource language. Participants are provided access to multi-way corpora in the high-resource languages of English, Spanish, German, Korean, and Indonesian, and using these, are required to identify beneficial instances, that when translated into the low-resource languages of Cebuano, Gujarati, and Burmese, lead to high-performing MT systems. More details on data, evaluation and submission can be found on the website (https://sites.google.com/view/coco4mt) or by emailing coco4mt-shared-task(a)googlegroups.com. SUBMISSION INFORMATION CoCo4MT will accept research, review, or position papers. The length of each paper should be at least four (4) and not exceed ten (10) pages, plus unlimited pages for references. Submissions should be formatted according to the official MT Summit 2023 style templates (https://www.overleaf.com/latex/templates/mt-summit-2023-template/knrrcnxhkq…). Accepted papers will be published in the MT Summit 2023 proceedings which are included in the ACL Anthology and will be presented at the conference either orally or as a poster. Submissions must be anonymized and should be made to the workshop using the Softconf conference management system (https://softconf.com/mtsummit2023/CoCo4MT). Scientific papers that have been or will be submitted to other venues must be declared as such, and must be withdrawn from the other venues if accepted and published at CoCo4MT. The review will be double-blind. We would like to encourage authors to cite papers written in ANY language that are related to the topics, as long as both original bibliographic items and their corresponding English translations are provided. Registration will be handled by the main conference. (To be announced) IMPORTANT DATES May 18, 2023 - Call for papers released May 19, 2023 - Shared task release of train, dev and test data May 25, 2023 - Shared task release of baselines June 5, 2023 - Second call for papers June 20, 2023 - Third and final call for papers July 05, 2023 - Paper submissions due July 05, 2023 - Shared task deadline to submit results July 20, 2023 - Notification of acceptance July 20, 2023 - Shared task system description papers due July 31, 2023 - Camera-ready due September 4-5, 2023 - CoCo4MT workshop CONTACT CoCo4MT Workshop Organizers: coco4mt-2023-organizers(a)googlegroups.com CoCo4MT Shared Task Organizers: coco4mt-shared-task(a)googlegroups.com ORGANIZING COMMITTEE (listed alphabetically) Ananya Ganesh University of Colorado Boulder Constantine Lignos Brandeis University John E. Ortega Northeastern University Jonne Sälevä Brandeis University Katharina Kann University of Colorado Boulder Marine Carpuat University of Maryland Rodolfo Zevallos Universitat Pompeu Fabra Shabnam Tafreshi University of Maryland William Chen Carnegie Mellon University PROGRAM COMMITTEE (listed alphabetically tentative) Abteen Ebrahimi University of Colorado Boulder Adelani David Saarland University Ananya Ganesh University of Colorado Boulder Alberto Poncelas ADAPT Centre at Dublin City University Anna Currey Amazon Amirhossein Tebbifakhr University of Trento Atul Kr. Ojha National University of Ireland Galway Ayush Singh Northeastern University Barrow Haddow University of Edinburgh Bharathi Raja Chakravarthi National University of Ireland Galway Beatrice Savoldi University of Trento Bogdan Babych Heidelberg University Briakou Eleftheria University of Maryland Constantine Lignos Brandeis University Dossou Bonaventure Mila Quebec AI Institute Duygu Ataman New York University Eleftheria Briakou University of Maryland Eleni Metheniti Université Toulosse - Paul Sabatier Jasper Kyle Catapang University of Birmingham John E. Ortega Northeastern University Jonne Sälevä Brandeis University Kalika Bali Microsoft Katharina Kann University of Colorado Boulder Kochiro Watanabe The University of Tokyo Koel Dutta Chowdhury Saarland University Liangyou Li Huawei Manuel Mager University of Stuttgart Maria Art Antonette Clariño University of the Philippines Los Baños Marine Carpuat University of Maryland Mathias Müller University of Zurich Nathaniel Oco De La Salle University Niu Xing Amazon Patrick Simianer Lilt Rico Sennrich University of Zurich Rodolfo Zevallos Universitat Pompeu Fabra Sangjee Dondrub Qinghai Normal University Santanu Pal Saarland University Sardana Ivanova University of Helsinki Shantipriya Parida Silo AI Shiran Dudy Northeastern University Surafel Melaku Lakew Amazon Tommi A Pirinen University of Tromsø Valentin Malykh Moscow Institute of Physics and Technology Xing Niu Amazon Xu Weijia University of Maryland

1 0

Call for participation - Arabic NER Shared Task 2023
by nagham ghanim 20 May '23

20 May '23

Dear colleagues, We are happy to invite you to join the *Arabic NER SharedTask 2023* <https://dlnlp.ai/st/wojood/> which will be organized as part of the WANLP 2023. We will provide you with a large corpus and Google Colab notebooks to help you reproduce the baseline results. دعوة للمشاركة في مسابقة استخراج الكيونات المسماه من النصوص العربية. سنزود المشاركين بمدونة وبرمجيات للحصول على نتائج مرجعية يمكنهم البناء عليها. *INTRODUCTION* Named Entity Recognition (NER) is integral to many NLP applications. It is the task of identifying named entity mentions in unstructured text and classifying them to predefined classes such as person, organization, location, or date. Due to the scarcity of Arabic resources, most of the research on Arabic NER focuses on flat entities and addresses a limited number of entity types (person, organization, and location). The goal of this shared task is to alleviate this bottleneck by providing Wojood, a large and rich Arabic NER corpus. Wojood consists of about 550K tokens (MSA and dialect, in multiple domains) that are manually annotated with 21 entity types. *REGISTRATION* Participants need to register via this form ( *https://forms.gle/UCCrVNZ2LaPviCZS6* <https://forms.gle/UCCrVNZ2LaPviCZS6>). Participating teams will be provided with common training development datasets. No external manually labelled datasets are allowed. Blind test data set will be used to evaluate the output of the participating teams. Each team is allowed a maximum of 3 submissions. All teams are required to report on the development and test sets (after results are announced) in their write-ups. *FAQ* For any questions related to this task, please check our *Frequently Asked Questions* <https://docs.google.com/document/d/1XE2n89mFLic2P9DO_sAD51vy734BOt0kgtZ6bFf…> *IMPORTANT DATES* - March 03, 2023: Registration available - May 25, 2023: Data-sharing and evaluation on development set Avaliable - June 10, 2023: Registration deadline - July 20, 2023: Test set made available - July 30, 2023: Evaluation on test set (TEST) deadline - September 5, 2023: Shared task system paper submissions due - October 12, 2023: Notification of acceptance - October230, 2023: camera-ready papers due ** All deadlines are 11:59 PM UTC-12:00 (Anywhere On Earth).* *CONTACT* For any questions related to this task, please contact the organizers directly using the following email address: *NERSharedtask2023(a)gmail.com <NERSharedtask2023(a)gmail.com>* or join the google group: *https://groups.google.com/g/ner_sharedtask2023* <https://groups.google.com/g/ner_sharedtask2023>. *SHARED TASK* As described, this shared task targets both flat and nested Arabic NER. The subtasks are: *Subtask 1:* *Flat NER* In this subtask, we provide the Wojood-Flat train (70%) and development (10%) datasets. The final evaluation will be on the test set (20%). The flat NER dataset is the same as the nested NER dataset in terms of train/test/dev split and each split contains the same content. The only difference in the flat NER is each token is assigned one tag, which is the first high-level tag assigned to each token in the nested NER dataset. *Subtask 2:* *Nestd NER* In this subtask, we provide the Wojood-Nested train (70%) and development (10%) datasets. The final evaluation will be on the test set (20%). *METRICS* The evaluation metrics will include precision, recall, F1-score. However, our official metric will be the micro F1-score. The evaluation of shared tasks will be hosted through CODALAB. Teams will be provided with a CODALAB link for each shared task. -*CODALAB link for NER Shared Task Subtask 1 (Flat NER)* <https://codalab.lisn.upsaclay.fr/competitions/11594> -*CODALAB link for NER Shared Task Subtask 2 (Nestd NER)* <https://dlnlp.ai/st/wojood/> *BASELINES* Two baseline models trained on Wojood (flat and nested) are provided: *Nested NER baseline:* is presented in this *article* <https://aclanthology.org/2022.lrec-1.387/>, and code is available in *GitHub* <https://github.com/SinaLab/ArabicNER>. The model achieves a micro F1-score of 0.9059 (note that this baseline does not handle nested entities of the same type). *Flat NER baseline:* same code repository for nested NER (*GitHub* <https://github.com/SinaLab/ArabicNER>) can also be used to train flat NER task. Our flat NER baseline achieved a micro F1-score of 0.8785. *GOOGLE COLAB NOTEBOOKS* To allow you to experiment with the baseline, we authored four Google Colab notebooks that demonstrate how to train and evaluate our baseline models. [1] *Train Flat NER* <https://gist.github.com/mohammedkhalilia/72c3261734d7715094089bdf4de74b4a>: This notebook can be used to train our ArabicNER model on the flat NER task using the sample Wojood data found in our repository. [2] *Evaluate Flat NER* <https://gist.github.com/mohammedkhalilia/c807eb1ccb15416b187c32a362001665>: this notebook will use the trained model saved from the notebook above to perform evaluation on unseen dataset. [3] *Train Nested NER* <https://gist.github.com/mohammedkhalilia/a4d83d4e43682d1efcdf299d41beb3da>: This notebook can be used to train our ArabicNER model on the nested NER task using the sample Wojood data found in our repository. [4] *Evaluate Nested NER* <https://gist.github.com/mohammedkhalilia/9134510aa2684464f57de7934c97138b>: this notebook will use the trained model saved from the notebook above to perform evaluation on unseen dataset. *ORGANIZERS* - Mustafa Jarrar, Birzeit University - Muhammad Abdul-Mageed, University of British Columbia & MBZUAI - Mohammed Khalilia, Birzeit University - Bashar Talafha, University of British Columbia - AbdelRahim Elmadany, University of British Columbia - Nagham Hamad, Birzeit University - Alaa Omer, Birzeit University

1 0

2026

2025

2024

2023

2022

Corpora May 2023