CFP: The 1st Workshop on NLP for Languages Using Arabic Script (AbjadNLP 2025) - Corpora

10 Aug 2024


      The 1st Workshop on NLP for Languages Using Arabic Script
(AbjadNLP 2025)
Abu Dhabi, UAE
19-20 January 2024
Submission URL: https://softconf.com/coling2025/AbjadNLP25/
Co-located with COLING 2025 Conference, Abu Dhabi, UAE (19-20 January 
2025)
AbjadNLP is dedicated to advancing innovation and gaining deeper 
insights into Natural Language Processing (NLP) for languages that use 
the Arabic script. Our primary focus is on Abjad and Ajami languages 
that utilise the Arabic script or its variations. Traditionally 
associated with Semitic languages, Abjad scripts represent consonants in 
every syllable. In contrast, Ajami scripts denote the alphabetic use of 
the Arabic script in various African contexts, representing non-Arabic 
languages. We are interested in research on languages that fall under 
the Abjad or Ajami categories that use the Arabic script or any 
variations of it.
We invite contributions, discussions, and explorations that delve deep 
into the unique linguistic structures, resources, challenges, and 
untapped potential presented by Abjad and Ajami languages within the 
realm of NLP and language resources. Our goal is to create synergies 
among researchers by addressing the diverse phenomena and challenges 
inherent in these rich linguistic traditions.
The workshop is proud to highlight our connections with the Masakhane 
NLP community and collaborations with institutions worldwide, such as 
COMSATS on Urdu, and the long-standing UCREL NLP Group at Lancaster 
University, whose work encompasses over 20 languages worldwide, 
including Abjad and Ajami languages.
Note: We chose the name Abjad for simplicity, but our focus includes 
Abjad and other languages that have adopted the Arabic and Perso-Arabic 
scripts, as well as Ajami languages. We acknowledge that Sorani Kurdish, 
when written in Arabic script, follows an alphabet style rather than an 
Abjad style.
Workshop Description:
We welcome contributions, discussions, and explorations that thoroughly 
investigate the distinctive linguistic structures, resources, 
challenges, and untapped potential of Abjad and Ajami languages within 
the field of NLP and language resources. Our aim is to foster 
collaboration among researchers by addressing the varied phenomena and 
challenges inherent in these rich linguistic traditions.
Ajami languages, representing a myriad of African languages that have 
adopted the Arabic script, span at least 43 distinct languages, 
including Hausa, Fulfulde, Mandingo, Swahili, Wolof, Kanuri, and 
Tamazight. The combined number of speakers of these languages is 
estimated to exceed 200 million within Africa alone. Although Abjad has 
been traditionally associated with Semitic languages such as Arabic, 
Hebrew and Syriac, it has been adopted for writing by many other 
language communities as in Perso-Arabic scripts used in Persian, Urdu, 
Pashto, Sorani Kurdish, Azeri Turkish, Sindhi, and Uyghur, with a 
collective estimated speaker population exceeding 500 million. 
Altogether, these languages represent an approximate global aggregate of 
1 billion speakers.
The adoption of the Arabic script across diverse linguistic landscapes 
highlights its expansive and varied application, transcending genres 
such as governmental correspondences, poetic compositions, religious 
texts, and journalistic pursuits. This widespread use underscores the 
imperative need to enhance digital infrastructure, tools, and resources 
for these under-resourced languages. Advancing such resources is crucial 
to nurturing linguistic diversity and resilience in both digital and 
print media, ensuring the preservation of linguistic heritage in the 
digital age.
Currently, there is an increasing interest in various NLP communities, 
both in academia and industry, in writing systems. However, there is a 
lack of initiatives focusing on the diverse phenomena and challenges of 
the languages using an Abjad script. The AbjadNLP workshop aims to fill 
this gap, fostering collaboration and innovation in this vital area of 
study.
Motivation
Languages employing an Abjad script signify a pivotal and diverse 
fragment of the global linguistic mosaic, traversing numerous countries 
and regions and embodying a considerable populace of speakers. The 
linguistic wealth and geographical diffusion of languages covered by 
AbjadNLP present a prolific environment for exploration and advancement 
in NLP. By channeling attention towards these languages, the realm of 
NLP is poised to unlock access to an expansive and varied array of 
linguistic constructions, subtleties, and cultural contexts, pivotal for 
bolstering the versatility and adaptability of NLP models and 
applications. The extensive spectrum of these languages not only unfolds 
a valuable opportunity to amplify multilingualism and multiculturalism 
in NLP research but also forges pathways for addressing the requisites 
and challenges intrinsic to a diverse and extensive speaker population.
The broad adoption of Abjad scripts transcends diverse genres, including 
governmental correspondences, poetic compositions, religious texts, and 
journalistic pursuits. The sustained use of such scripts underscores the 
imperative need to enhance digital infrastructure, tools, and resources 
that elucidate the varied writing systems inherent to under-resourced 
languages. Such advancement is crucial to nurturing linguistic diversity 
and resilience in both digital and print media, ensuring that the 
linguistic heritage does not diminish in the digital age.
This workshop can contribute to more inclusive and equitable 
progressions in NLP, accommodating a broader assortment of languages and 
dialects and promoting enhanced comprehension and interconnectivity 
amongst varied linguistic communities. The assimilation and 
prioritization of these linguistically affluent and diverse languages 
are indispensable for the comprehensive progression and the universal 
adaptability of NLP technologies. While our workshop primarily targets 
languages using an Abjad script, we recognize that many historical 
languages such as Aramaic , Sogdian, Parthian and Phoenician employed 
such a writing system. As such, we believe that our workshop can enforce 
links with researchers working on endangered languages as well.
We are proud to highlight our existing connections with the Masakhane 
NLP community (www.masakhane.io) and collaborations with institutions 
worldwide, such as COMSAT on Urdu (www.comsats.edu.pk), and the 
long-standing UCREL NLP Group at Lancaster University, whose work 
encompasses over 20 languages worldwide, including Abjad and Ajami 
languages (http://ucrel-web-dev.lancs.ac.uk/ucrelng/).
Team
Our team is uniquely diverse and gender-balanced, comprising individuals 
from a wide range of ethnic backgrounds. We represent a spectrum of 
languages that use the Arabic script and include researchers from both 
Linguistics and NLP, enriching the ever-needed collaboration between 
these two fields. With expertise in language technology, Unicode, NLP, 
resources, and multilingual text analysis, together, we aim to foster a 
dynamic and inclusive environment for research and collaboration in the 
field of NLP.
Call for papers
We invite submissions on topics that include, but are not limited to, 
the following:
* Enabling core technologies: morphological analysis, disambiguation, 
tokenisation, POS tagging, named entity detection, chunking, parsing, 
semantic role labelling, sentiment analysis, language modelling, etc.
* Applications: machine translation, speech recognition, speech 
synthesis, optical character recognition, pedagogy, assistive 
technologies, social media, etc.
* Resources: dictionaries, annotated data, corpus, etc.
In addition, we extend a warm invitation to researchers and stakeholders 
across the spectrum to contribute papers focusing on, but not limited 
to, the following dimensions:
* Orthography descriptions (Constable 2002; Hosken 2003)
    * Advancements in Font Technology, Glyph Rendering, and OCR
    * Text Input Methodologies
    * Development and Utilisation of Exploitable Dictionaries
    * Enhancements in Spell-Checking Support
    * Advancements in Speech-to-Text Solutions
    * Progressive Natural Language Processing Techniques
    * BLARK - Basic Language Resource Kit descriptions for languages using 
Abjad or Ajami
    * Insights and Experiences Utilising Data Supplied by the Unicode 
Hosted Common Locale Data Repository in Abjad or Ajami.
    * Morphological and syntactical challenges in Abjad or Ajami 
Orthographies.
    * Development of open access corpora in Abjad or Ajami.
    * Text processing and transliteration challenges and solutions for 
languages using Abjad or Ajami.
    * Cultural and sociolinguistic considerations in NLP applications for 
Abjad or Ajami.
    * Languages resources and NLP tools for Abjad or Ajami.
Summary of the Call:
We welcome submissions of papers centred around the Abjad and Ajami 
theme, focusing on supporting NLP language resources for non-Arabic 
languages utilising Arabic script. We encourage submissions that span a 
spectrum from theoretical investigations to practical applications, 
aiming to underscore the distinctive challenges, solutions, and insights 
that languages using Ajami and Abjad scripts introduce to the field of 
NLP.
For the submission format and guidelines, we follow the COLING 2025 
standards. Authors are encouraged to thoroughly review and adhere to the 
COLING 2025 submission guidelines and author kit, which can be found at: 
https://coling2025.org/calls/submission_guidlines/.
If authors are describing an orthography, we request that they include 
the points recommended in (Hosken 2003 
https://scripts.sil.org/WP-Encoding). For continuity across the workshop 
and greater impact across industry applications, authors should consider 
terminological (orthography, script, writing system, etc.) differences 
presented by Constable (2002) 
https://www.sil.org/resources/publications/entry/7853. The model 
presented by Constable is the current Unicode model.
Please ensure that all submissions strictly conform to these standards 
to streamline the review process and maintain uniformity across all 
contributions. Both long papers (up to 8 pages) and short papers (up to 
4 pages) are welcome. All submissions will undergo a rigorous 
peer-review process, emphasizing originality, relevance, and clarity.
Submissions may be of two types:
* Long papers - up to eight (8) pages maximum, presenting substantial, 
original, completed, and unpublished work.
    * Short papers - up to four (4) pages, describing a small focused 
contribution, negative results, system demonstrations, etc.
Submission URL: https://softconf.com/coling2025/AbjadNLP25/
Submission Guidelines: 
https://coling2025.org/calls/submission_guidlines/
Provisional Key Dates:
* 1st Call for Papers Announcement: 16 July 2024
    * 2nd Call for Papers Announcement: 16 August 2024
    * Paper Submission Deadline: 15 November 2024
    * Notification of Paper Acceptance: 6 December 2024
    * Camera-ready Paper Deadline: 13 December 2024
    * Workshop Date: either on 19 or 20 January 2024
Anti-Harassment Policy:
The workshop supports the COLING anti-harassment policy 
https://coling2022.org/policy
Organising Committee:
General Chair:
* Dr. Mo El-Haj, Senior Lecturer at Lancaster University, is a Natural 
Language Processing expert with a focus on Arabic and under-resourced 
languages. He founded the FNP workshop series in 2018 and has organised 
workshops at top NLP conferences including LREC and COLING. 
http://elhaj.uk/ [1]
Programme Chairs:
* Mr Hugh Paterson III. Collaborative Scholar in linguistics, writing 
systems, metadata, and archives. http://hugh4.us [2]
    * Dr Saad Ezzini. Lecturer at Lancaster University, UK. Saad has 
experience working on low-resource languages with a focus on machine 
translation, QA, IR, and language modelling. http://ezzini.github.io [3]
    * Dr Ignatius Ezeani. Senior Research Associate working on 
multilingual NLP, Lancaster University, UK 
https://www.lancaster.ac.uk/scc/about-us/people/ignatius-ezeani [4]
Review Committee:
* Dr Manum Hayat Khan. Cognitive Linguistics Researcher at the 
University of La Rioja, Spain. 
https://investigacion.unirioja.es/investigadores/1183/detalle
    * Dr Muhammad Sharjeel. Assistant Professor working on Urdu NLP, 
COMSATS University Islamabad, Pakistan 
https://scholar.google.com/citations?user=xUF3l9gAAAAJ&hl=en [5]
Publication Chair:
* Dr Sina Ahmadi. Postdoctoral researcher at University of Zurich 
focusing on leveraging language technology to assist languages with 
constrained digital resources, with an emphasis on adapting current 
natural language processing approaches and existing resources for 
less-resourced languages. https://sinaahmadi.github.io/
Publicity Chairs:
* Ms Cynthia Amol. NLP Researcher focusing on low-resource languages 
at Maseno University, Kenya.  
https://ke.linkedin.com/in/cynthia-amol-779668119
    * Ms Amal Haddad Haddad. PhD Student in translations and terminologies 
at the University of Granada, Spain. http://lexicon.ugr.es/haddad
    * Dr Jaleh Delfani. Research Fellow in Translation at University of 
Surrey https://www.surrey.ac.uk/people/jaleh-delfani [6]
Advisory Committee:
* Prof. Ruslan Mitkov, Professor of Computing and Communications at 
Lancaster University, actively working on different research topics from 
the areas of Natural Language Processing (NLP), Computational 
Linguistics and Translation Technology.https://wp.lancs.ac.uk/mitkov/ 
[7]
    * Prof. Paul Rayson, Director of UCREL research centre at Lancaster 
University, specialises in semantic-based NLP across 20 languages, 
including Urdu and Arabic. With 25 years of experience, he excels in 
noisy language environments like financial disclosures and has organised 
various conferences and workshops. 
https://www.lancaster.ac.uk/staff/rayson/ [8]
Programme Committee*
* Abdoulaye Diallo. Fula & Wolof. Independent Researcher
    * Ahmed Abdelali. Arabic/Multilingual NLP. Qatar Computing Research 
Institute (QCRI), Qatar.
    * Ahmed AbuRa'ed. Arabic. UBC. Canada.
    * Alp Oktem. Tigrinya and Kanuri. Translators without Borders
    * Antonio Moreno Sandoval. Low Resourced Languages. UAM. Spain
    * Azizud Din. Pashto. University Malaysia Sarawak. Malaysia
    * Behnam Sabeti. Persian. Miras Technologies International. Iran
    * Chenggang Mi. Uyghur. Xinjiang Technical Institute. China
    * Clement Oyeleke. Yoruba. University of Ibadan. Nigeria
    * Daniel Whitenack. Kimbundu, Fulfulde, Pular. SIL International. USA
    * Derguene Mbaye. Wolof. Baamtu. Senegal
    * Djamel Mostefa. Pashto. ELDA, France.
    * Doaa Samy. Arabic. Cairo University, Egypt and LLI-UAM. Spain
    * Elias W BA. Fula and Wolof. Baamtu. Senegal
    * Eric Atwell. Arabic/Multilingual NLP. Leeds University, UK.
    * Frederick Apina. Swahili. Parrot.AI. Tanzania
    * George Giannakopoulos. Multilingual NLP. SKEL Lab - NCSR Demokritos. 
Greece
    * Haithem Afli. Arabic/Multilingual NLP. Dublin City University, 
Ireland.
    * Hazem Hajj. Arabic/Multilingual NLP. American University of Beirut, 
Lebanon.
    * Houda Bouamor. Arabic/Multilingual NLP. CMU. Qatar
    * Ignatius Ezeani. Igbo, African Languages NLP. Lancaster University, 
UK.
    * Imed Zitouni. Arabic/Multilingual NLP. Microsoft Research, USA.
    * Karim Bouzoubaa. Arabic/Multilingual NLP. Mohamed Vth University, 
Morocco.
    * Mariam Masoud. Sorani Kurdish. National University of Ireland 
Galway. Ireland
    * Lei Wang. Uyghur. Xinjiang Technical Institute. China
    * Marina Litvak. Hebrew and Arabic. Sami Shamoon College of 
Engineering, Israel
    * Mo El-Haj. Arabic/Multilingual and Low resourced Languages. 
Lancaster University, UK
    * Muhammad Sharjeel. Urdu. COMSATS University Islamabad, Pakistan.
    * Omid Momenzadeh. Persian. Miras Technologies International. Iran
    * Paul Rayson. Multilingual and Low resourced Languages. Lancaster 
University, UK
    * Preni Golazizian. Persian. Miras Technologies International. Iran
    * Rao Muhammad Adeel Nawab.  Urdu. COMSATS University Islamabad, 
Pakistan.
    * Reza Fahmi. Persian. Miras Technologies International. Iran
    * Samuel Olanrewaju. Yoruba, Yagba and Basa. University of Ibadan. 
Nigeria
    * Scott Piao. Multilingual and Low resourced Languages. Lancaster 
University, UK
    * Seyed Arad Ashrafi Asli. Persian. Miras Technologies International. 
Iran
    * Shervin Malmasi. Sorani Kurdish. Macquarie University. Australia
    * Sina Ahmadi. Sorani Kurdish. National University of Ireland Galway. 
Ireland
    * Sokhar Samb. Wolof. ML & NLP. Senegal
    * Tonghai Jiang. Uyghur. Xinjiang Technical Institute. China
    * Waziri Shebogholo. Swahili. Parrot.AI. Tanzania
    * Wole Akin. IsiXhosa, Yorùbá, Hausa, and Igbo. University of 
Johannesburg. South Africa
    * Xi Zhou. Uyghur. Xinjiang Technical Institute. China
    * Yating Yang. Uyghur. Xinjiang Technical Institute. China
    * Zahra Majdabadi. Persian. Miras Technologies International. Iran
_*We are in the process of forming a linguistically diverse program 
committee who are experts in languages that use Arabic Script (Abjad and 
Ajami), with the majority of the list already confirmed to serve as 
reviewers. As soon as we gain access to SoftConf, we will extend 
invitations to the remaining committee (if you see your name on the list 
and want it removed, please contact any of the organisers). If your name 
appears in this list and you want it removed, please contact us as soon 
as possible and we'll make sure it's removed. Thanks_
Links:
------
[1] http://elhaj.uk/
[2] http://hugh4.us/
[3] http://ezzini.github.io/
[4] https://www.lancaster.ac.uk/scc/about-us/people/ignatius-ezeani
[5] https://scholar.google.com/citations?user=xUF3l9gAAAAJ&hl=en
[6] https://www.surrey.ac.uk/people/jaleh-delfani
[7] https://wp.lancs.ac.uk/mitkov/
[8] https://www.lancaster.ac.uk/staff/rayson/