[Corpora-List] 2nd CFP: Challenges in the Management of Large Corpora (CMLC-11)

24 Apr 2023


      11^th Workshop on the Challenges in the Management of Large Corpora (CMLC)
The next meeting of CMLC will be held as part ofCorpus Linguistics 2023 
https://wp.lancs.ac.uk/cl2023/ in Lancaster, UK, on the 2^nd of July, 
2023.
See https://corpora.ids-mannheim.de/cmlc-2023.html for up-to-date 
information.
Important dates
* Deadline for abstract submission: the 27^th of April 2023 (Thursday,
    23:59 UTC)
  * Notification of acceptance: the 11^th of May 2023 (Thursday)
  * Deadline for the submission of camera-ready papers: the 4^th of June
    2023 (Sunday)
  * Meeting: Sunday, the 2nd of July 2023, 9.30-12.30 in George Fox LT2
    (Lancaster University Campus)
Abstract submission
* We invite anonymised extended abstracts for/oral presentations/on
    the topics listed below (/ideally/using theACL-2023 templates
    https://2023.aclweb.org/calls/style_and_formatting/, or PDF,
    750-1000 words excluding references, font preferably 11 pt, line
    spacing 1.5).
  * CMLC has always reserved a track for national corpus project
    reports, and to this end, we invite/poster proposals/of 500-750
    words. National project reports need not be anonymised.
Submissions are accepted through the EasyChair submission system, 
athttps://easychair.org/conferences/?conf=cmlc11.
Please note that each CMLC event produces a volume of proceedings 
(published in Open Access before the meeting), where both oral and 
poster contributions have equal status./All/final submissions to the 
2023 proceedings volume will be expected to be formatted according to 
theACLPUB guidelines 
https://acl-org.github.io/ACLPUB/formatting.htmland to pass 
theaclpubcheck https://github.com/acl-org/aclpubcheck.
Workshop description
The upcoming CMLC meeting continues the successful series of “Challenges 
in the management of large corpora” events, previously hosted at LREC 
(since 2012) and CL (since 2015) conferences. As in the previous 
meetings, we wish to explore common areas of interest across a range of 
issues in language resource management, corpus linguistics, natural 
language processing, and data science.
Large textual datasets require careful design, collection, cleaning, 
encoding, annotation, storage, retrieval, and curation to be of use for 
a wide range of research questions and to users across a number of 
disciplines. A growing number of national and other very large corpora 
are being made available, many historical archives are being digitised, 
numerous publishing houses are opening their textual assets for text 
mining, and many billions of words can be quickly sourced from the web 
and online social media.
A number of key themes and questions emerge of interest to the 
contributing research communities: (a) what can be done to deal with IPR 
and data protection issues? (b) what sampling techniques can we apply? 
(c) what quality issues should we be aware of? (d) what infrastructures 
and frameworks are being developed for the efficient storage, 
annotation, analysis and retrieval of large datasets? (e) what 
affordances do visualisation techniques offer for the exploratory 
analysis approaches of corpora? (f) what kinds of APIs or other means of 
access would make the corpus data as widely usable as possible without 
interfering with legal restrictions? (g) how to guarantee that corpus 
data remain available and usable in a sustainable way?
Motivation and topics of interest
This year’s event will cover the entire range of the standard CMLC 
themes, with some new additions:
*
New and hot topics
o Language Models
          + What linguistic insights can we gain by post-hoc language
            model analysis in the age of ChatGPT?
          + How can we avoid the proliferation of stereotypes in terms
            of both linguistic surface form and content when using
            language models for linguistic analysis?
      o Societal and legal issues relevant for corpora and studies
          + political and sociological balance ○ social media bubbles,
            hate speech and fake news
          + proliferation of stereotypes via corpora and language models
          + corpora as archives of the past: evolution in mentalities or
            laws, personality rights
      o How to make corpora as accessible as possible despite big data
        issues, application heterogeneity, and IPR issues
          + What are the most interesting APIs and libraries to build,
            analyse and access very large corpora?
          + How can we get us researchers to use existing research
            tools, infrastructures, libraries and APIs in research and
            teaching?
  *
Linguistic content challenges
o Dealing with the variety of language resources: multilinguality,
        historical texts, noisy OCR texts, user-generated content, etc.
      o Integration of human computation (crowdsourcing) and automatic
        annotation
      o Quality management of annotations
  *
Technical challenges
o Storage and retrieval solutions for big textual data corpora:
        primary data, metadata, and annotation data
      o Scalable and efficient NLP tooling for annotating and analysing
        large datasets: distributed and GPGPU computing; using big data
        analysis frameworks for language processing
      o Dealing with streaming (e.g. Social Media) and rapidly changing
        underlying data
  *
Exploitation challenges
o Legal and privacy issues
      o Query languages, data models, and standardisation
      o Licensing models of open and closed data, coping with
        intellectual property restrictions
      o Innovative approaches for aggregation and visualisation of text
        analytics
In the tradition of CMLC, we invite reports on national corpus 
initiatives; submitters of these reports should be prepared to present a 
poster along with a short presentation.
Programme Committee
Names are being added as Programme Committee members confirm their 
participation.
* Laurence Anthony (Waseda University, Japan)
  * Vladimír Benko (Slovak Academy of Sciences)
  * Tomaž Erjavec (Jožef Stefan Institute, Ljubljana)
  * Stephanie Evert (Friedrich-Alexander-Universität Erlangen-Nürnberg)
  * Johannes Graën (University of Zurich, Switzerland)
  * Andrew Hardie (Lancaster University, UK)
  * Serge Heiden (ENS de Lyon)
  * Dawn Knight (Cardiff University)
  * Michal Křen (Charles University, Prague)
  * Martin Reynaert (Tilburg University)
  * Kevin Scannell (Saint-Louis University)
Organising Committee
Institut für Deutsche Sprache, Mannheim
📩 Piotr Bański,Marc Kupietz,Harald Lüngen
Berlin-Brandenburg Academy of Sciences
📩 Adrien Barbaresi
Institute of Computational Linguistics, University of Zurich
Simon Clematide
Homepage
CMLC series homepage is located athttp://corpora.ids-mannheim.de/cmlc.html

2026

2025

2024

2023

2022

[Corpora-List] 2nd CFP: Challenges in the Management of Large Corpora (CMLC-11)