Dear colleagues,
the 4th Workshop on Evaluation and Comparison for NLP systems (Eval4NLP), co-located at the 2023 Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics (AACL 2023), invites the submission of long and short papers, with a theoretical or experimental nature, describing recent advances in system evaluation and comparison in NLP.
****_New_****: This year's edition of the Eval4NLP workshop puts a focus on the evaluation of and through large language models (LLMs). Notably, the workshop will feature a shared task on LLM evaluation and specifically encourages the submission of LLM evaluation focused papers. Other submissions that fit the general scope of Eval4NLP are of course also welcome. See below for more details.
** Important Dates **
All deadlines are 11.59 pm UTC -12h (“Anywhere on Earth”).
- Direct submission to Eval4NLP deadline: August 25 - Submission of pre-reviewed papers to Eval4NLP (see below for details) : September 25 - Notification of acceptance: October 2 - Camera-ready papers due: October 10 - Workshop day: November 1
Please see the Call for Papers for more details: https://eval4nlp.github.io/2023/cfp.html.
** Shared Task **
This year’s version will come with a shared task on explainable evaluation of generated language (MT and summarization) with a focus on LLM prompts. Please find more information on the shared task page: https://eval4nlp.github.io/2023/shared-task.html.
** Topics **
Designing evaluation metrics: Proposing and/or analyzing metrics with desirable properties, e.g., high correlations with human judgments, strong in distinguishing high-quality outputs from mediocre and low-quality outputs, robust across lengths of input and output sequences, efficient to run, etc.; Reference-free evaluation metrics, which only require source text(s) and system predictions; Cross-domain metrics, which can reliably and robustly measure the quality of system outputs from heterogeneous modalities (e.g., image and speech), different genres (e.g., newspapers, Wikipedia articles and scientific papers) and different languages; Cost-effective methods for eliciting high-quality manual annotations; and Methods and metrics for evaluating interpretability and explanations of NLP models.
Creating adequate evaluation data: Proposing new datasets or analyzing existing ones by studying their coverage and diversity, e.g., size of the corpus, covered phenomena, representativeness of samples, distribution of sample types, variability among data sources, eras, and genres; and Quality of annotations, e.g., consistency of annotations, inter-rater agreement, and bias check.
Reporting correct results: Ensuring and reporting statistics for the trustworthiness of results, e.g., via appropriate significance tests, and reporting of score distributions rather than single-point estimates, to avoid chance findings; reproducibility of experiments, e.g., quantifying the reproducibility of papers and issuing reproducibility guidelines; and Comprehensive and unbiased error analyses and case studies, avoiding cherry-picking and sampling bias.
** Submission Guidelines **
The workshop welcomes two types of submission -- long and short papers. Long papers may consist of up to 8 pages of content, plus unlimited pages of references. Short papers may consist of up to 4 pages of content, plus unlimited pages of references. Please follow the ACL ARR formatting requirements, using the official templates: https://github.com/acl-org/acl-style-files. Final versions of both submission types will be given one additional page of content for addressing reviewers’ comments. The accepted papers will appear in the workshop proceedings. The review process is double-blind. Therefore, no author information should be included in the papers and the (optional) supplementary materials. Self-references that reveal the author's identity must be avoided. Papers that do not conform to these requirements will be rejected without review.
** The submission sites on Openreview **
Standard submissions: https://openreview.net/group?id=aclweb.org/AACL-IJCNLP/2023/Workshop/Eval4NL...) Pre-reviewed submissions: https://openreview.net/group?id=aclweb.org/AACL-IJCNLP/2023/Workshop/Eval4NL...)
See below for more information on the two submission modes.
** Two submission modes: standard and pre-reviewed **
Eval4NLP features two modes of submissions. Standard submissions: We invite the submission of papers that will receive up to three double-blind reviews from the Eval4NLP committee, and a final verdict from the workshop chairs. Pre-reviewed: To a later deadline, we invite unpublished papers that have already been reviewed, either through ACL ARR, or recent AACL/EACL/ACL/EMNLP/COLING venues (these papers will not receive new reviews but will be judged together with their reviews via a meta-review; authors are invited to attach a note with comments on the reviews and describe possible revisions).
Final verdicts will be either accept, reject, or conditional accept, i.e., the paper is only accepted provided that specific (meta-)reviewer requirements have been met. Please also note the multiple submission policy.
** Optional Supplementary Materials **
Authors are allowed to submit (optional) supplementary materials (e.g., appendices, software, and data) to improve the reproducibility of results and/or to provide additional information that does not fit in the paper. All of the supplementary materials must be zipped into one single file (.tgz or .zip) and submitted via Openreview together with the paper. However, because supplementary materials are completely optional, reviewers may or may not review or even download them. So, the submitted paper should be fully self-contained.
** Preprints **
Papers uploaded to preprint servers (e.g., ArXiv) can be submitted to the workshop. There is no deadline concerning when the papers were made publicly available. However, the version submitted to Eval4NLP must be anonymized, and we ask the authors not to update the preprints or advertise them on social media while they are under review at Eval4NLP.
** Multiple Submission Policy **
Eval4NLP allows authors to submit a paper that is under review in another venue (journal, conference, or workshop) or to be submitted elsewhere during the Eval4NLP review period. However, the authors need to withdraw the paper from all other venues if they get accepted and want to publish in Eval4NLP. Note that AACL and ARR do not allow double submissions. Hence, papers submitted both to the main conference and AACL workshops (including Eval4NLP) will violate the multiple submission policy of the main conference. If authors would like to submit a paper under review by AACL to the Eval4NLP workshop, they need to withdraw their paper from AACL and submit it to our workshop before the workshop submission deadline.
** Best Paper Awards **
We will optionally award prizes to the best paper submissions (subject to availability; more details to come soon). Both long and short submissions will be eligible for prizes.
** Presenting Published Papers **
If you want to present a paper which has been published recently elsewhere (such as other top-tier AI conferences) at our workshop, you may send the details of your paper (Paper title, authors, publication venue, abstract, and a link to download the paper) directly to eval4nlp@gmail.com. We will select a few high-quality and relevant papers to present at Eval4NLP. This allows such papers to gain more visibility from the workshop audience and increases the variety of the workshop program. Note that the chosen papers are considered as non-archival here and will not be included in the workshop proceedings.
-------------------------------------------------
Best wishes,
Eval4NLP organizers
Website: https://eval4nlp.github.io/2023/index.html Email: eval4nlp@gmail.com
Dear Eval4NLP organizers
Thanks for your effort to keep on evaluating NLP systems.
The following formulations/initiatives have come to my attention. I have some questions:
i. "** Shared Task ** This year’s version will come with a shared task on explainable evaluation of generated language (MT and summarization) with a focus on LLM prompts.": will data used to train the LLMs be included in the evaluation? Has anyone claimed that input data does not affect results in any way? Will there be any explicit evaluation with data statistics (in combination with textual representations)?
ii. "Designing evaluation metrics: Proposing and/or analyzing metrics with desirable properties, e.g., high correlations with human judgments,": which "humans" / who will be selected to judge? Is there selection bias in the participants chosen for evaluation? Is this reflected in the results? (If human testing were involved, have ethical considerations been addressed?)
iii. "Reference-free evaluation metrics": I understand there might have been a practice somewhat "normalized" in the CL/NLP space --- that human evaluations are always/usually at odds with automatic ones. Could it be that one has been evaluating only based on textual surface values, irrespective of data statistics? While there is nothing wrong with pointing out potential human biases, I think no evaluation for NLP systems can be complete if one does not also evaluate the data used and all relevant statistics.
iv. Re "Cross-domain metrics, which can reliably and robustly measure the quality of system outputs from heterogeneous modalities (e.g., image and speech), different genres (e.g., newspapers, Wikipedia articles and scientific papers) and different languages;": to what extent do domains, genres, and languages matter in the context of computing if there are no differences in data statistics whatsoever [1]? How about an explicit statistical eval?
ML and NLP systems are used ubiquitously nowadays. It is our responsibility as technologists to act responsibly. Our work affects various sectors as well as the general public --- not to mention it is a way through which technology/computing-at-large can be evaluated.
I look forward to your responses.
Best Ada [1] This could be an interesting research question indeed! One that I was going to work on if my work/life hadn't been interrupted/"disrupted" by some scientific misconduct, including but not limited to the denial of research results, lack of transparency in evaluation, potential collusion in research and/or publication misconduct.
On Tue, Aug 15, 2023 at 9:37 AM Juri Opitz via Corpora < corpora@list.elra.info> wrote:
Dear colleagues,
the 4th Workshop on Evaluation and Comparison for NLP systems (Eval4NLP), co-located at the 2023 Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics (AACL 2023), invites the submission of long and short papers, with a theoretical or experimental nature, describing recent advances in system evaluation and comparison in NLP.
****_New_****: This year's edition of the Eval4NLP workshop puts a focus on the evaluation of and through large language models (LLMs). Notably, the workshop will feature a shared task on LLM evaluation and specifically encourages the submission of LLM evaluation focused papers. Other submissions that fit the general scope of Eval4NLP are of course also welcome. See below for more details.
** Important Dates **
All deadlines are 11.59 pm UTC -12h (“Anywhere on Earth”).
- Direct submission to Eval4NLP deadline: August 25
- Submission of pre-reviewed papers to Eval4NLP (see below for details) :
September 25
- Notification of acceptance: October 2
- Camera-ready papers due: October 10
- Workshop day: November 1
Please see the Call for Papers for more details: https://eval4nlp.github.io/2023/cfp.html.
** Shared Task **
This year’s version will come with a shared task on explainable evaluation of generated language (MT and summarization) with a focus on LLM prompts. Please find more information on the shared task page: https://eval4nlp.github.io/2023/shared-task.html.
** Topics **
Designing evaluation metrics: Proposing and/or analyzing metrics with desirable properties, e.g., high correlations with human judgments, strong in distinguishing high-quality outputs from mediocre and low-quality outputs, robust across lengths of input and output sequences, efficient to run, etc.; Reference-free evaluation metrics, which only require source text(s) and system predictions; Cross-domain metrics, which can reliably and robustly measure the quality of system outputs from heterogeneous modalities (e.g., image and speech), different genres (e.g., newspapers, Wikipedia articles and scientific papers) and different languages; Cost-effective methods for eliciting high-quality manual annotations; and Methods and metrics for evaluating interpretability and explanations of NLP models.
Creating adequate evaluation data: Proposing new datasets or analyzing existing ones by studying their coverage and diversity, e.g., size of the corpus, covered phenomena, representativeness of samples, distribution of sample types, variability among data sources, eras, and genres; and Quality of annotations, e.g., consistency of annotations, inter-rater agreement, and bias check.
Reporting correct results: Ensuring and reporting statistics for the trustworthiness of results, e.g., via appropriate significance tests, and reporting of score distributions rather than single-point estimates, to avoid chance findings; reproducibility of experiments, e.g., quantifying the reproducibility of papers and issuing reproducibility guidelines; and Comprehensive and unbiased error analyses and case studies, avoiding cherry-picking and sampling bias.
** Submission Guidelines **
The workshop welcomes two types of submission -- long and short papers. Long papers may consist of up to 8 pages of content, plus unlimited pages of references. Short papers may consist of up to 4 pages of content, plus unlimited pages of references. Please follow the ACL ARR formatting requirements, using the official templates: https://github.com/acl-org/acl-style-files. Final versions of both submission types will be given one additional page of content for addressing reviewers’ comments. The accepted papers will appear in the workshop proceedings. The review process is double-blind. Therefore, no author information should be included in the papers and the (optional) supplementary materials. Self-references that reveal the author's identity must be avoided. Papers that do not conform to these requirements will be rejected without review.
** The submission sites on Openreview **
Standard submissions: https://openreview.net/group?id=aclweb.org/AACL-IJCNLP/2023/Workshop/Eval4NL...) Pre-reviewed https://openreview.net/group?id=aclweb.org/AACL-IJCNLP/2023/Workshop/Eval4NLP&referrer=%5BHomepage%5D(%2F)Pre-reviewed submissions: https://openreview.net/group?id=aclweb.org/AACL-IJCNLP/2023/Workshop/Eval4NL...)
See below for more information on the two submission modes.
** Two submission modes: standard and pre-reviewed **
Eval4NLP features two modes of submissions. Standard submissions: We invite the submission of papers that will receive up to three double-blind reviews from the Eval4NLP committee, and a final verdict from the workshop chairs. Pre-reviewed: To a later deadline, we invite unpublished papers that have already been reviewed, either through ACL ARR, or recent AACL/EACL/ACL/EMNLP/COLING venues (these papers will not receive new reviews but will be judged together with their reviews via a meta-review; authors are invited to attach a note with comments on the reviews and describe possible revisions).
Final verdicts will be either accept, reject, or conditional accept, i.e., the paper is only accepted provided that specific (meta-)reviewer requirements have been met. Please also note the multiple submission policy.
** Optional Supplementary Materials **
Authors are allowed to submit (optional) supplementary materials (e.g., appendices, software, and data) to improve the reproducibility of results and/or to provide additional information that does not fit in the paper. All of the supplementary materials must be zipped into one single file (.tgz or .zip) and submitted via Openreview together with the paper. However, because supplementary materials are completely optional, reviewers may or may not review or even download them. So, the submitted paper should be fully self-contained.
** Preprints **
Papers uploaded to preprint servers (e.g., ArXiv) can be submitted to the workshop. There is no deadline concerning when the papers were made publicly available. However, the version submitted to Eval4NLP must be anonymized, and we ask the authors not to update the preprints or advertise them on social media while they are under review at Eval4NLP.
** Multiple Submission Policy **
Eval4NLP allows authors to submit a paper that is under review in another venue (journal, conference, or workshop) or to be submitted elsewhere during the Eval4NLP review period. However, the authors need to withdraw the paper from all other venues if they get accepted and want to publish in Eval4NLP. Note that AACL and ARR do not allow double submissions. Hence, papers submitted both to the main conference and AACL workshops (including Eval4NLP) will violate the multiple submission policy of the main conference. If authors would like to submit a paper under review by AACL to the Eval4NLP workshop, they need to withdraw their paper from AACL and submit it to our workshop before the workshop submission deadline.
** Best Paper Awards **
We will optionally award prizes to the best paper submissions (subject to availability; more details to come soon). Both long and short submissions will be eligible for prizes.
** Presenting Published Papers **
If you want to present a paper which has been published recently elsewhere (such as other top-tier AI conferences) at our workshop, you may send the details of your paper (Paper title, authors, publication venue, abstract, and a link to download the paper) directly to eval4nlp@gmail.com. We will select a few high-quality and relevant papers to present at Eval4NLP. This allows such papers to gain more visibility from the workshop audience and increases the variety of the workshop program. Note that the chosen papers are considered as non-archival here and will not be included in the workshop proceedings.
Best wishes,
Eval4NLP organizers
Website: https://eval4nlp.github.io/2023/index.html Email: eval4nlp@gmail.com _______________________________________________ Corpora mailing list -- corpora@list.elra.info https://list.elra.info/mailman3/postorius/lists/corpora.list.elra.info/ To unsubscribe send an email to corpora-leave@list.elra.info
Dear Ada,
thank you for the interest in our workshop topic. First, directly on the concrete question about our shared task:
will data used to train the LLMs be included in the evaluation?
We try to mitigate this issue by relying on very recently created data from sources of 2023. Maybe this also helps you for the other questions on the shared task.
Your following questions, or, if I may, thoughts, are on the general topics that our workshop is about. I personally think these contain valid points and interesting reflections on evaluation and evaluation practices in NLP, some of them I believe are touching on deeper issues quite prevalent in NLP and ML, such as potential selection biases in human judges. Generally speaking, Eval4NLP also warmly welcomes contributions that discuss, introduce and extend such or similar thoughts/opinions, critiques, reflections, outlooks, etc. on evaluation and evaluation practices in NLP, wishing to be a forum for exchange of ideas and feedback.
Best wishes
Juri
Dear Juri
Thanks for your reply.
Would you mind please addressing my question(s) at least under [i] more explicitly? That is: Has anyone claimed that input data does not affect results in any way? Will there be any explicit evaluation with data statistics (in combination with textual representations)? and will data used to train the LLMs be included in the evaluation? Where is the data statistics reported? Are there any LLM evaluation reports stating that such statistics or data does not matter in prompting?
There are certainly more opportunities for *more rigorous and detailed *statistical evaluations in the "language and computing" space, esp. through use-inspired basic experiments with more carefully controlled parameters. Can the LLMs be reproduced? Have they been trained with full vocabulary? These seem to be good questions to start with for our current LLM evaluation endeavors. There are also other methods besides using LLMs for evaluation --- if using LLMs is "better", this deserves a more fine-grained, explicit evaluation (as in, not just "human vs automatic"-type of narrative).
On your shared task description: "This year’s Eval4NLP shared task, combines these two aspects. We provide a selection of open-source, pre-trained LLMs. *The task is to develop strategies to extract scores from these LLM’s that grade machine translations and summaries. We will specifically focus on prompting techniques*, therefore, fine-tuning of the LLM’s is not allowed." [1] Would you please let me know how this works statistically? What statistical studies/publications are there to underlie the assumption that it is sufficient to evaluate through textual prompts? (I understand there are "[a]n exemplary selection of recent, related, prompt-based metrics" listed on your site and as with most CL/NLP papers there are tons of end numerical results reported in them --- but is this practice sufficient? Even if one might have inadvertently "normalized" the practice in the past?) And what statistical relations are there between the statistical working/mechanics of LLMs and the data statistical profiles of translations/summaries? It seems if one were to just evaluate based on textual representations, it would be all too coarse-grained (and insufficient) of an evaluation. Would you not agree?
Please understand, also, that LLMs have now been incorporated into some educational curriculum, for better or for worse. There is a responsibility to report transparently, honestly, and comprehensively (i.e. not just based on "textual eye candies" and then hide things with an onslaught of numerical end results). It's important to not create/reinforce unnecessary hype.
Btw, what does "sentence" in: - "They are mostly based on black-box language models and usually return a single quality score (sentence-level), making it difficult to explain their internal decision process and their outputs.", - "The task will consist of building a reference-free metric for machine translation and/or summarization that predicts sentence-level quality scores constructed from fine-grained scores or error labels.", and - "The sentences that should be evaluated are wrapped into a prompt that is passed to a language model as an input." [1] refer to here? Line as delimited by line breaks? Or are you segmenting by "sentences"? (Remember "all grammar leaks" and fairness matters in multilnguality!)
[1] from https://eval4nlp.github.io/2023/shared-task.html
Best Ada
On Tue, Aug 15, 2023 at 3:58 PM Juri Opitz via Corpora < corpora@list.elra.info> wrote:
Dear Ada,
thank you for the interest in our workshop topic. First, directly on the concrete question about our shared task:
will data used to train the LLMs be included in the evaluation?
We try to mitigate this issue by relying on very recently created data from sources of 2023. Maybe this also helps you for the other questions on the shared task.
Your following questions, or, if I may, thoughts, are on the general topics that our workshop is about. I personally think these contain valid points and interesting reflections on evaluation and evaluation practices in NLP, some of them I believe are touching on deeper issues quite prevalent in NLP and ML, such as potential selection biases in human judges. Generally speaking, Eval4NLP also warmly welcomes contributions that discuss, introduce and extend such or similar thoughts/opinions, critiques, reflections, outlooks, etc. on evaluation and evaluation practices in NLP, wishing to be a forum for exchange of ideas and feedback.
Best wishes
Juri _______________________________________________ Corpora mailing list -- corpora@list.elra.info https://list.elra.info/mailman3/postorius/lists/corpora.list.elra.info/ To unsubscribe send an email to corpora-leave@list.elra.info