The Fourth Workshop on Human Evaluation of NLP Systems (HumEval 2024) invites the submission of long and short papers on current human evaluation research and future directions. HumEval 2024 will take place in Turin (Italy) on May 21 2024, during LREC-COLING 2024.
Website: https://humeval.github.io/
Important dates: Submission deadline: 11 March 2024 Paper acceptance notification: 4 April 2024 Camera-ready versions: 19 April 2024 HumEval 2024: 21 May 2024 LREC-COLING 2024 conference: 20–25 May 2024
All deadlines are 23:59 UTC-12.
===============================================
Human evaluation plays a central role in NLP, from the large-scale crowd-sourced evaluations carried out e.g. by the WMT workshops, to the much smaller experiments routinely encountered in conference papers. Moreover, while NLP embraced a number of automatic evaluation metrics, the field has always been acutely aware of their limitations (Callison-Burch et al., 2006; Reiter and Belz, 2009; Novikova et al., 2017; Reiter, 2018; Mathur et al., 2020a), and has gauged their trustworthiness in terms of how well, and how consistently, they correlate with human evaluation scores (Gatt and Belz, 2008; Popović and Ney, 2011., Shimorina, 2018; Mille et al., 2019; Dušek et al., 2020, Mathur et al., 2020b). Yet there is growing unease about how human evaluations are conducted in NLP. Researchers have pointed out the less than perfect experimental and reporting standards that prevail (van der Lee et al., 2019; Gehrmann et al., 2023), and that low-quality evaluations with crowdworkers may not correlate well with high-quality evaluations with domain experts (Freitag et al., 2021). Only a small proportion of papers provide enough detail for reproduction of human evaluations, and in many cases the information provided is not even enough to support the conclusions drawn (Belz et al., 2023). We have found that more than 200 different quality criteria (such as Fluency, Accuracy, Readability, etc.) have been used in NLP, and that different papers use the same quality criterion name with different definitions, and the same definition with different names (Howcroft et al., 2020). Furthermore, many papers do not use a named criterion, asking the evaluators only to assess 'how good' the output is. Inter and intra-annotator agreement are usually given only in the form of an overall number without analysing the reasons and causes for disagreement and potential to reduce them. A small number of papers have aimed to address this from different perspectives, e.g. comparing agreement for different evaluation methods (Belz and Kow, 2010), or analysing errors and linguistic phenomena related to disagreement (Pavlick and Kwiatkowski, 2019; Oortwijn et al., 2021; Thomson and Reiter, 2020; Popović, 2021). Context beyond sentences needed for a reliable evaluation has also started to be investigated (e.g. Castilho et al., 2020). The above aspects all interact in different ways with the reliability and reproducibility of human evaluation measures. While reproducibility of automatically computed evaluation measures has attracted attention for a number of years (e.g. Pineau et al., 2018, Branco et al., 2020), research on reproducibility of measures involving human evaluations is a more recent addition (Cooper & Shardlow, 2020; Belz et al., 2023).
The HumEval workshops (previously at EACL 2021, ACL 2022, and RANLP 2023) aim to create a forum for current human evaluation research and future directions, a space for researchers working with human evaluations to exchange ideas and begin to address the issues human evaluation in NLP faces in many respects, including experimental design, meta-evaluation and reproducibility. We will invite papers on topics including, but not limited to, the following topics as addressed in any subfield of NLP
- Experimental design and methods for human evaluations - Reproducibility of human evaluations - Inter-evaluator and intra-evaluator agreement - Ethical considerations in human evaluation of computational systems - Quality assurance for human evaluation - Crowdsourcing for human evaluation - Issues in meta-evaluation of automatic metrics by correlation with human evaluations - Alternative forms of meta-evaluation and validation of human evaluations - Comparability of different human evaluations - Methods for assessing the quality and the reliability of human evaluations - Role of human evaluation in the context of Responsible and Accountable AI
Submissions for both short and long papers will be made directly via START, following submission guidelines issued by LREC-COLING 2024. For full submission details please refer to the workshop website.
The third ReproNLP Shared Task on Reproduction of Automatic and Human Evaluations of NLP Systems will be part of HumEval, offering (A) an Open Track for any reproduction studies involving human evaluation of NLP systems; and (B) the ReproHum Track where participants will reproduce the papers currently being reproduced by partner labs in the EPSRC ReproHum project. A separate call will be issued for ReproNLP 2024.