Dear colleagues
I'm devoted to the revitalization and massification of the Andean Amazonian
native language with computational processing as a key enabler.
Among the many tasks to do, nowadays I'm dealing with the creation of
neologisms. That is why I'm looking for the larger multilingual dictionary
of phonetic spelling, even better if that database includes asian languages
(mandarin, japanese, korean, hindi, urdu, etc).
If you have this kind of database, I kindly ask you for bring me access, if
you don't, I'd appreciate any clue about where and/or how access to it
Kind regards
Luis Camacho <https://orcid.org/0000-0001-6569-550X>
------------------------------
Attention,
the Special Interest Group on Computational Morphology and Phonology
(SIGMORPHON) is currently soliciting proposals for Shared Tasks for the
2023 workshop.
If you're interested in organizing a shared task, or know someone who might
be interested, please visit
https://sigmorphon.github.io/workshops/2023/call_for_tasks/ for details.
Important dates:
Submission of proposal: September 30, 2022
Notification of acceptance: October 15, 2022
Data ready / task begins: January 31, 2023
Workshop: TBA, Summer 2023
Any inquiries can be sent to the official SIGMORPHON e-mail address:
sigmorphon(a)gmail.com
Garrett Nicolai
SIGMORPHON President
In this newsletter:
Upcoming Policy Change to LDC's Open Memberships
LDC at Interspeech 2022
LanguageARC: Citizen Science for Language
30th Anniversary Highlight: Switchboard
New publications:
Xi'an Guanzhong Object Naming<https://catalog.ldc.upenn.edu/LDC2022S09>
MASRI Synthetic<https://catalog.ldc.upenn.edu/LDC2022S08>
________________________________
Upcoming Policy Change to LDC's Open Memberships
LDC is changing Its open membership year policy beginning January 1, 2023. Only one membership year will be open for joining - the current membership year. The 2022 membership year will close for joining on December 31, 2022. We expect this change to have a minimal impact on members, while allowing us to streamline our processes to serve members better. LDC's many membership benefits<https://www.ldc.upenn.edu/members/benefits> will remain the same and organizations choosing to join membership years in advance will still be able to do so. If you have any questions about this change, please don't hesitate to contact our membership office<mailto:ldc@ldc.upenn.edu>.
LDC at Interspeech 2022
LDC is proud to sponsor the Workshop for Young Female Researchers in Speech<https://sites.google.com/view/yfrsw-2022/> (YFRSW) to be held in-person as an Interspeech 2022<https://interspeech2022.org/> pre-conference satellite event on September 17. Also, be sure to check out the collaborative work of LDC's Mark Liberman, "The mapping between syntactic and prosodic phrasing in English and Mandarin", presented during the On-Site Oral Session: Phonetics and Phonology on Wednesday, September 21, 13:30-15:30 KST.
LanguageARC: Citizen Science for Language
LanguageARC<https://languagearc.com> is a citizen science web portal for language research developed by LDC with the support of the National Science Foundation (grant #1730377).
LanguageARC brings together researchers and participants from the general public interested in language to form a community dedicated to support and advance language-related research and development. Contributors to this online community can participate in a variety of language-related tasks and activities such as reading text, answering questions, describing images or video, creating or evaluating transcriptions for audio clips, or developing translations into their native languages. LanguageARC includes projects in languages other than English, such as French, Sesotho, and Swedish. Xi'an Guanzhong Object Naming LDC2022S09<https://catalog.ldc.upenn.edu/LDC2022S09>, released this month in LDC's Catalog and described below, is an example of a data set developed using LanguageARC. New projects will be added on an ongoing basis.
Sign up for a LanguageARC account today to start making real contributions to language knowledge and research. Please share this information with colleagues, students, and anyone who might be interested in participating in the language activities on this website. If you are a researcher interested in creating a project on Language ARC, please reach out on the site's Contact<https://languagearc.com/messages/new> page.
Find LanguageARC on Facebook at: https://www.facebook.com/languagearc
30th Anniversary Highlight: Switchboard
Switchboard-1 Release 2 (LDC97S62<https://catalog.ldc.upenn.edu/LDC97S62>) is considered the first large collection of spontaneous conversational telephone speech (Graff & Bird, 2000<https://www.ldc.upenn.edu/sites/www.ldc.upenn.edu/files/lrec2000-many-uses-…>). It consists of approximately 260 hours of recordings collected by Texas Instruments in 1990-1991 (Godfrey et al., 1992<https://isip.piconepress.com/projects/switchboard/doc/education/papers/pape…>). The first release of the corpus (later superseded) was published by NIST and distributed by LDC in 1993.
Participants were 543 speakers (302 male, 241 female) from across the United States who accounted for around 2,400 two-sided telephone conversations. A robot operator handled the calls, giving the caller appropriate recorded prompts, selecting and dialing another person (the callee) to take part in a conversation, introducing a topic for discussion, and recording the speech from the two subjects into separate channels until the conversation was finished. Roughly 70 topics were provided, of which about 50 were used frequently. Selection of topics and callees was constrained so that: (1) no two speakers would converse together more than once and (2) no one spoke more than once on a given topic.
This gold standard data set has been used for many HLT applications, including speaker identification, speaker authentication, and speech recognition. It is considered one of the most important benchmarks for recognition tasks involving large vocabulary conversational speech (Deshmukh et al., 1998<https://www.researchgate.net/profile/Aravind-Ganapathiraju/publication/2214…>) as well as a key resource for studying the phonetic properties of spontaneous speech (Greenberg et al., 1996<https://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.65.3655&rep=rep1&…>). Annotation tasks based on Switchboard include discourse tags/speech acts, part-of-speech tagging and parsing, and sentiment analysis<https://catalog.ldc.upenn.edu/LDC2020T14>.
The Switchboard series includes Switchboard Credit Card<https://catalog.ldc.upenn.edu/LDC93S8>, Phase II<https://catalog.ldc.upenn.edu/LDC98S75>, Phase III<https://catalog.ldc.upenn.edu/LDC2002S06>, the Switchboard Cellular<https://catalog.ldc.upenn.edu/LDC2001S13> collection, and new recordings from 18 Switchboard participants in the 2013 Greybeard<https://catalog.ldc.upenn.edu/LDC2013S05> corpus.
All Switchboard corpora are available in the Catalog for licensing by Consortium members and non-members. Visit Obtaining Data<https://www.ldc.upenn.edu/language-resources/data/obtaining> for more information.
________________________________
New publications:
Xi'an Guanzhong Object Naming<https://catalog.ldc.upenn.edu/LDC2022S09> is comprised of 15 hours of audio recordings from speakers of the Guanzhong dialect of Mandarin Chinese living in or near Xi'an in Shaangxi Province (China) naming objects that appeared in colored line drawings. The corpus was developed to support traditional and computer aided language documentation.
The collection was conducted from February-May 2021 using LanguageARC<https://languagearc.com/>, a citizen science portal developed by LDC, from a closed volunteer community. Speakers were presented with images selected from the MultiPic dataset<https://www.bcbl.eu/databases/multipic> and were asked to record themselves naming the objects in the images.
Xi'an Guanzhong Object Naming is distributed via web download.
2022 Subscription Members will automatically receive copies of this corpus. 2022 Standard Members may request a copy as part of their 16 free membership corpora. Non-members may license this data for a fee.
*
MASRI Synthetic<https://catalog.ldc.upenn.edu/LDC2022S08> MASRI (Maltese Automatic Speech Recognition I) Synthetic was developed by the MASRI team <https://www.um.edu.mt/projects/masri/> at the University of Malta<https://www.um.edu.mt/> and contains 99 hours of synthesized Maltese speech.
Source sentences were extracted from the Maltese Language Resource Server<https://mlrs.research.um.edu.mt/index.php?page=corpora> (MLRS) corpus, comprised of written or transcribed Maltese covering various genres, including parliamentary debates, news, law, opinion, sports, culture, academic, literature, and religious texts. Text was processed through the CrimsonWing text-to-speech system to generate speech files. Synthesized speech was created with 210 voices (105 female, 105 male).
MASRI Synthetic is distributed via web download.
2022 Subscription Members will automatically receive copies of this corpus provided they have submitted a completed copy of the special license agreement. 2022 Standard Members may request a copy as part of their 16 free membership corpora. Non-members may license this data for a fee.
Membership Coordinator
Linguistic Data Consortium<ldc.upenn.edu>
University of Pennsylvania
T: +1-215-573-1275
E: ldc(a)ldc.upenn.edu<mailto:ldc@ldc.upenn.edu>
M: 3600 Market St. Suite 810
Philadelphia, PA 19104
Second Call for Papers Global WordNet Conference 2023
[Apologies for cross posting]
Call for Papers
12th International Global Wordnet Conference
Donostia / San Sebastian, Basque Country
January 23-27, 2023
Global Wordnet Association: www.globalwordnet.org
Conference website: https://hitz.eus/gwc2023
The Global Wordnet Association is pleased to announce the 12th
International Global Wordnet Conference (GWC2023) in Donostia / San
Sebastian (Spain) hosted by HiTZ, Basque Center for Language Technology at
the University of the Basque Country.
NOTE: COVID-19 allowing, the conference will be in person only.
Organisers: Begoña Altuna, Itziar Aldabe, Xabier Arregi, Itziar
Gonzalez-Dios, Aritz Farwell and Esther Miranda.
Details about the Association and the full announcement for the conference
can be found on the conference website: https://hitz.eus/gwc2023
We invite submissions with original contributions addressing, but not
limited to, the topics listed below. Proposals for tutorials are welcome as
well.
Conference Topics
1.
Lexical semantics and meaning representation
* Critical analysis and applications of lexical and semantic relations
* Proposed new relations
* Definitions, semantic components, co-occurrence and frequency statistics
* Word, Sense and Context Embeddings
* Necessity and completeness issues
* Ontology and wordnet
* Other lexicographical and lexicological questions pertaining to
wordnet-style meaning representation
* Wordnets and other modalities
1.
Architecture of lexical databases
* Language independent and language dependent components
* Integration of multi-wordnets in research infrastructures (like CLARIN,
ELG, etc.)
* Wordnets and Linked Open Data (LOD)
1.
Tools and methods for wordnet development
* User and Data entry interfaces
* Methods for constructing, extending and enriching wordnets
* Methods for linking wordnets to other lexical and semantic resources
* Methods for leveraging existing wordnets and semantic networks with large
language models
1.
Applications of wordnet
* Word sense disambiguation
* Text generation
* Commonsense reasoning
* Machine translation
* Information extraction and retrieval
* Document structuring and categorisation
* Automatic hyperlinking
* Language pedagogy
* Psycholinguistic applications
* Embeddings and pretrained language models
* Probing large neural language models
1.
Standardization, distribution and availability of wordnets and wordnet
tools.
Submissions will fall into one of the following categories (page limits
exclude references):
* long papers: 8 pages max, 30 minute presentation
* short papers: 5 pages max; 15 minute presentation
* project reports: 5 pages max., 10 minute presentation
* demonstrations : 5 pages max, with an additional 3 pages screen dumps or
images; 20 minute presentation
Submissions should be anonymous and any identifying information must be
removed. Authors must state the preferred category, though acceptance may
be subject to change in the category of the presentation, e.g. a long paper
submission may be accepted as a short paper.
Final papers should be submitted in electronic form (PDF only).
Paper submissions must use the official ACL style templates, which are
available from here <https://github.com/acl-org/acl-style-files> (Latex and
Word). Please follow the paper formatting guidelines general to “*ACL”
conferences available here
<https://acl-org.github.io/ACLPUB/formatting.html>. Authors may not modify
these style files or use templates designed for other conferences.
Submission site: https://easychair.org/conferences/?conf=gwc2023
Important Dates (NEW DATES!!!)
1.
September 30, 2022 Deadline for abstract submission
2.
October 14, 2022 Deadline for paper submission
1.
November 18, 2022 Notification of acceptance
1.
December 1, 2022 Registration opens
1.
December 23, 2022 Deadline author registration, final version paper
1.
January 23-27, 2023 Conference
Proceedings
Conference proceedings will be open access and downloadable from the GWA
website. The proceedings will have an ISBN and be published in the ACL
anthology.
Papers are only included in the proceedings if at least one author has
registered.
Inclusion of accepted submissions into the final program and the
proceedings is contingent upon at least one author’s registration. Late
registration and on-site registration for participants is possible without
inclusion of the paper and without presentation.
Conference Chairs
German Rigau - german.rigau(a)ehu.eus
Francis Bond - bond(a)ieee.org
Local Organizing Chairs
Begoña Altuna - begona.altuna(a)ehu.eus
Itziar Aldabe - itziar.aldabe(a)ehu.eus
Xabier Arregi - xabier.arregi(a)ehu.eus
Itziar Gonzalez-Dios - itziar.gonzalezd(a)ehu.eus
Aritz Farwell - asfarwell(a)ehu.eus
Esther Miranda - esther.miranda(a)ehu.eus
Program Committee (to be confirmed and extended)
Adam Pease, Articulate Software
Ales Horak, Masaryk University
Alexandre Rademaker, IBM Research Brazil and EMAp/FGV
Bolette Pedersen, University of Copenhagen
Christiane Fellbaum, Princeton University
Darja Fiser, University of Ljubljana
David Lindemann, IWiSt, University of Hildesheim
Diptesh Kanojia, IIT Bombay
Eneko Agirre, University of the Basque Country
Ewa Rudnicka, Wrocław University of Technology
Francis Bond, Palacký University
Gerard De Melo, Rutgers University
German Rigau, IXA Group, UPV/EHU
Haldur Oim, University of Tartu
Heili Orav, University of Tartu
Hugo Gonçalo-Oliveira, Department of Informatics Engineering of the
University of Coimbra
Janos Csirik, University of Szeged
John Mccrae, National University of Ireland, Galway
Kadri Vider, University of Tartu
Kevin Scannell, Saint Louis University
Kyoko Kanzaki, Otemon Gakuin University
Maciej Piasecki, Department of Computational Intelligence, Wroclaw
University
Marten Postma, Vrije Universiteit Amsterdam
Paul Buitelaar, National University of Ireland, Galway
Piek Vossen, VU University Amsterdam.
Sanni Nimb, The Danish Society for Language and Literature
Shan Wang, The Education University of Hong Kong
Shu-Kai Hsieh, National Taiwan Normal University
Sonja Bosch, Department of African Languages, University of South Africa
Thierry Declerck, DFKI, Saarbruecken
Tim Baldwin, The University of Melbourne
Tomaž Erjavec, Dept. of Knowledge Technologies, Jožef Stefan Institute
Umamaheswari Vasanthakumar, Nanyang Technological University
Valeria Depaiva, Natural Language and AI Research Laboratory of Nuance
Communications, Inc.
Verginica Mititelu, Romanian Academy Research Institute for Artificial
Intelligence
Sponsors
Keler https://www.keler.eus/en
Dear NLP Researchers,
For the 5th consecutive year the Emerging Market Welfare Project and the
Europeam Commission Joint Research Centre would like to invite you to test
your event extraction systems at the
three shared tasks in The 5th Workshop on Challenges and Applications of
Automated Extraction
of Socio-political Events from Text. (CASE 2022).
The shared tasks feature detection of politically-motivated
conflict events, especially detection of protests and riots.
You can use the occasion to test your system in real-life text collection.
Please, follow the link to the workshop and register, if you want to
participate:
https://lnkd.in/dbGp2jRe
The organizers of CASE 2022
[Apologies for cross-posting]
*Literary Machine Translation as a Human-Machine Dialectic*
International conference
Thursday, 6 October 2022
University of Liège, Belgium
Program and registration: https://www.cirti.uliege.be/litMT2022
The Centre interdisciplinaire de recherches en traduction et en
interprétation (CIRTI, University of Liège) will hold a one-day
symposium centred around the topic of literary machine translation and
dialogue between human and machine as a potential computer-assisted
literary creation tool.
• Can we re-imagine and use machine translation tools for creative means?
• How do we train systems adapted to the literary domain?
• What impact would it have on creativity, quality and translators' voice?
• What are the ethical challenges brought about by new technologies?
Join us for a day of exchanges, presentations and round tables as we
tackle these questions.
All best,
The Organizing Team
---
*Damien Hansen*
Université de Liège - Université Grenoble Alpes
CIRTI - LIG/GETALP - LGL
The Speech Technology Group of Toshiba Europe LTD in Cambridge has opening
for a researcher to work on multi-modal interfaces. The position offers
researchers to work with an interdisciplinary team focussing on both speech
and vision modalities. We are looking for candidates with a PhD or Masters
with deep learning experience who will contribute to advancing multi-modal
research and building prototype systems.
Please check here for more details:
https://careers.toshiba.eu/displayjob.aspx?jobid=351
-----------------------
Dr Svetlana Stoyanchev
Speech Technology Group,
Cambridge Research Lab,
Toshiba Europe Limited
https://www.linkedin.com/in/svetlana-stoyanchev/
Dear all,
We are really excited to be offering the highly-successful free online course (MOOC) in Corpus linguistics: 'Corpus Linguistics: Method, Analysis, Interpretation'. This anniversary tenth run of the course starts on 19 September 2022 and runs for eight weeks.
If you are interested, you can register now for free at https://www.futurelearn.com/courses/corpus-linguistics by clicking on 'Join today'.
As every year, we have included brand new features to the course keeping you up to date with new developments in the field.
I hope to see you on the course!
Best,
Vaclav
**Special offer - £500 off Lancaster tuition fees**
If you decide to study the free MOOC, or if you actively participated and completed the course within the last three years, you are eligible to apply for our MOOC entry route to Lancaster University's MA (2 years, online) or Postgraduate Certificate (1 year, online). You can still apply for the 2022 start. By doing so, you will not take the first core module of the programme, Fundamentals of Corpus Linguistics, but will be expected to submit two written assessments, in January of year 1, using the knowledge you developed through the MOOC. This entry route has a fee discount of £500, which will be deducted from the first year of your fees.
MA https://www.lancaster.ac.uk/study/postgraduate/postgraduate-courses/corpus-…
PG Certificate https://www.lancaster.ac.uk/study/postgraduate/postgraduate-courses/corpus-…
Professor Vaclav Brezina
Professor in Corpus Linguistics
Department of Linguistics and English Language
ESRC Centre for Corpus Approaches to Social Science
Faculty of Arts and Social Sciences, Lancaster University
Lancaster, LA1 4YD
Office: County South, room C05
T: +44 (0)1524 510828
[cid:image001.png@01D8C82C.E9592780]@vaclavbrezina
[cid:image002.png@01D8C82C.E9592780]<http://www.lancaster.ac.uk/arts-and-social-sciences/about-us/people/vaclav-…>
***Translations & Open Science calls for tenders***
The OPERAS Research Infrastructure launches a series of calls for
tenders in order to lay the foundation of a technology-based scientific
translation service to foster multilingualism in scholarly communication
and thus help to remove language barriers according to Open Science
principles.
The first two calls are now open (submission deadline: 7 October 2022)
1. Mapping and collection of scientific bilingual corpora: identifying,
collecting and preparing corpora of bilingual scientific texts which
will serve as training dataset for specialised translation engines,
source data for terminology extraction, and translation memory creation
Link to call 1:
https://www.operas-eu.org/mapping-and-collection-of-scientific-bilingual-co…
2. Use case study for a technology-based scientific translation service:
drafting an overview of the current translation practices and challenges
in scholarly communication and defining the use cases of a
technology-based scientific translation service (expected users and
usage scenarios, features, quality requirements, editorial and technical
workflows)
Link to call 2:
https://www.operas-eu.org/use-case-study-for-a-technology-based-scientific-…
Please note that two additional calls will be released in the coming
months in the following areas: Machine translation output evaluation and
Roadmap and budget projections.
For any information about ongoing and future calls, please feel free to
contact Susanna Fiorini at susanna.fiorini(a)operas-eu.org
We would like to draw your attention to currently open call
for full-time academic position of Assistant Professor in the
field of Natural Language Processing at the Faculty of Informatics,
Masaryk University in Brno, Czech Republic.
https://www.muni.cz/en/about-us/careers/vacancies/70340
Assistant Professor Position in Natural Language Processing
Department
Department of Machine Learning and Data Processing – Faculty of Informatics
Deadline
30 Sep 2022
Start date
By mutual agreement.
The Dean of the Faculty of Informatics MU invites applications
for a position of Assistant Professor in Natural Language
Processing, with the Department of Machine Learning and Data
Processing.
This position is aimed to strengthen the work of the Natural
Language Processing Centre (NLP Centre - https://nlp.fi.muni.cz/)
at the Faculty of Informatics. NLP Centre conducts basic and
applied research in all areas of text and speech analysis and
knowledge engineering with applications in data analysis projects
(often in cooperation with industrial partners) and education of
future language and data analysts. Besides research and
education, the abilities to work with a team of graduate students
on research targeting top NLP conferences and to engage
undergraduate and graduate students in both educational and
research exercises are crucial.
Job description key points
- Active international cooperation, in research and education.
- Involvement in teaching in the natural language processing area.
- Supervision of Master/Bachelor theses and consultancy or
co-supervision of PhDs.
- Involvement in expanding industrial cooperation in the
natural language processing area.
Requirements
- PhD in Informatics or related discipline.
- Passion for problem solving and desire for continuous
improvement in teaching skills.
- Existing track record in both education and research in
natural language processing.
- Expert knowledge in (several) areas covered by courses:
- PA153 Natural Language Processing
https://is.muni.cz/course/fi/PA153
- IA161 Natural Language Processing in Practice
https://is.muni.cz/course/fi/IA161
- PA164 Machine learning and natural language processing
https://is.muni.cz/course/fi/PA164
- PA154 Language Modeling
https://is.muni.cz/course/fi/PA154
- PV061 Machine Translation
https://is.muni.cz/course/fi/PV061
- PV277 Programming Applications for Social Robots
https://is.muni.cz/course/fi/PV277
- PA156 Dialogue Systems
https://is.muni.cz/course/fi/PA156
- IB047 Introduction to Corpus Linguistics and Computer Lexicography
https://is.muni.cz/course/fi/IB047
- Practical involvement in the development phase of software
project(s) with ability to demonstrate tools developed and
showcase data analyses performed.
- Dynamic, flexible personality, able to work well in teams.
- Languages – fluent English (both spoken and written), other
language(s) welcome.
- Experience from other countries than Czech & Slovak
republics (at least half a year).
Desired skills and achievements
- Experience with research achievements published at the top
NLP conferences or best journals publishing NLP research
results.
- Ability to work well in interdisciplinary teams.
- Successfully defended Bachelor and/or Master theses supervised.
- Experience with research project team leadership is an advantage.
- Open-source projects development and maintenance.
Other information
The starting salary for this Assistant Professor position is
76,500 CZK, and with the progress in this tenure-track
position can in no more than 3 years be modified based on the
level of involvement in research and educational projects.
Applicants should submit
- CV;
- degree documents;
- summary of work experience, publication and teaching
activities and involvement in research grants;
- cover letter explaining your interest in the position and
the IT area;
- title and abstract of a lecture that can be presented as
a part of the application process;
- names and contacts of three professional and language
referees.
Please submit your application, including all required documents,
preferably online via the MU e-application at Vacancies | Masaryk
University (muni.cz)
In case this way of submission would not be possible, we also
accept a paper application with a declaration of the reason for
the such submission.
Any queries regarding the submission shall be sent to:
pers(a)fi.muni.cz.
Queries regarding the position as such can be addressed to
assoc. prof. Ales Horak: hales(a)fi.muni.cz.
Further information is available
Masaryk University at https://www.muni.cz/en
Brno at https://www.gotobrno.cz/en/
--
Ales Horak
Faculty of Informatics
Masaryk University
Brno, Czech Republic