Dear Colleagues,

As language models take center stage not only in NLP but in a vast array of scientific applications, the question of how it is best to map natural language in textual form into vector space gains more and more interest. While most popular models still use subword tokens as their atomic units, “token-free” methods including character-level, byte-level, and encoding of visual text rendering have been making promising progress. Still, development and analysis of tokenization and untokenization methods is advancing at a slower rate than research in model architecture and optimization technologies, mostly due to the early stage at which representation is applied, which makes evaluation of new algorithms and techniques particularly challenging. Fundamental insights into the effect of representation atomicity on morphological modeling, on multilingual and crosslingual applications, on computation efficiency, on representations of groups in society, and on other aspects, are still being gained, making this research topic ripe for aggregation and integration of findings and methodologies.

Our special issue, entitled Atoms of Representation in Natural Language Processing, aims to collect such findings and insights, to encourage diving deep into the relationships between language and computation, and to foster holistic approaches and collaboration in development and assessment of different aspects of representation in language models and other NLP systems and applications.

Suggested themes and article types for submissions include:

Novel schemas for subword tokenization and for tokenizer application methodologies
Benchmarks and analyses of tokenizer effectiveness and quality, including crosslingual and multilingual setups, morphological aspects, information-theoretic constructions, correlation with quality of learned embeddings and downstream model performance, ability to handle linguistic phenomena, security implications, societal implications, etc.
Development, modification, evaluation, and analysis of token-free representation schemata based on textual input
Development, modification, evaluation, and analysis of token-free representation schemata utilizing multimodal input such as visual, spatial, or acoustic signals; combination of different linguistic signals (auditory, textual, sign language) into a single input framework
Theoretic contributions addressing expressive power or limitations of various textual representation methodologies
Analysis of the textual modality and its representation on the computational level, e.g. of Unicode standards

The full call is available here: https://www.mdpi.com/journal/applsci/special_issues/2SHP0751R0.

Several waiver discounts are available (contact me personally).

--
- Yuval Pinter
Guest Editor