Dear colleagues,
you are invited to participate in the Eval4NLP 2023 shared task on **Prompting Large Language Models as Explainable Metrics**.
Please find more information below and on the shared task webpage: https://eval4nlp.github.io/2023/shared-task.html
Important Dates
- Shared task announcement: August 02, 2023 - Dev phase: August 07, 2023 - Test phase: September 18, 2023 - System Submission Deadline: September 23, 2023 - System paper submission deadline: October 5, 2023 - System paper camera ready submission deadline: October 12, 2023
All deadlines are 11.59 pm UTC -12h (“Anywhere on Earth”). The timeframe of the test phase may change. Please regularly check the shared task webpage: https://eval4nlp.github.io/2023/shared-task.html.
** Overview **
With groundbreaking innovations in unsupervised learning and scalable architectures the opportunities (but also risks) of automatically generating audio, images, video and text, seem overwhelming. Human evaluations of this content are costly and are often infeasible to collect. Thus, the need for automatic metrics that reliably judge the quality of generation systems and their outputs, is stronger than ever. Current state-of-the-art metrics for natural language generation (NLG) still do not match the performance of human experts. They are mostly based on black-box language models and usually return a single quality score (sentence-level), making it difficult to explain their internal decision process and their outputs.
The release of APIs to large language models (LLMs), like ChatGPT and the recent open-source availability of LLMs like LLaMA has led to a boost of research in NLP, including LLM-based metrics. Metrics like GEMBA [*] explore the prompting of ChatGPT and GPT4 to directly leverage them as metrics. Instructscore [*] goes in a different direction and finetunes a LLaMA model to predict a fine grained error diagnosis of machine translated content. We notice that current work (1) does not systematically evaluate the vast amount of possible prompts and prompting techniques for metric usage, including, for example, approaches that explain a task to a model or let the model explain a task itself, and (2) rarely evaluates the performance of recent open-source LLMs, while their usage is incredibly important to improve the reproducibility of metric research, compared to closed-source metrics.
This year’s Eval4NLP shared task, combines these two aspects. We provide a selection of open-source, pre-trained LLMs. The task is to develop strategies to extract scores from these LLM’s that grade machine translations and summaries. We will specifically focus on prompting techniques, therefore, fine-tuning of the LLM’s is not allowed.
Based on the submissions, we hope to explore and formalize prompting approaches for open-source LLM-based metrics and, with that, help to improve their correlation to human judgements. As many prompting techniques produce explanations as a side product we hope that this task will also lead to more explainable metrics. Also, we want to evaluate which of the selected open-source models provide the best capabilities as metrics, thus, as a base for fine-tuning.
** Goals **
The shared task has the following goals:
Prompting strategies for LLM-based metrics: We want to explore which prompting strategies perform best for LLM-based metrics. E.g., few-shot prompting [*], where examples of other solutions are given in a prompt, chain-of-thought reasoning (CoT) [*], where the model is prompted to provide a multi-step explanation itself, or tree-of-thought prompting [*], where different explanation paths are considered, and the best is chosen. Also, automatic prompt generation might be considered [*]. Numerous other recent works explore further prompting strategies, some of which use multiple evaluation passes.
Score aggregation for LLM-based metrics: We also want to explore which strategies best aggregate the model scores from LLM-based metrics. E.g., scores might be extracted as the probability of a paraphrase being created [*], or they could be extracted from LLM output directly [*].
Explainability for LLM-based metrics: We want to analyze whether the metrics that provide the best explanations (for example with CoT) will achieve the highest correlation to human judgements. We assume that this is the case, due to the human judgements being based on fine-grained evaluations themselves (e.g. MQM for machine translation)
** Task Description **
The task will consist of building a reference-free metric for machine translation and/or summarization that predicts sentence-level quality scores constructed from fine-grained scores or error labels. Reference-free means that the metric rates the provided machine translation solely based on the provided source sentence/paragraph, without any additional, human written references. Further, we note that many open-source LLMs have mostly been trained on English data, adding further challenges to the reference-free setup.
To summarize, the task will be structured as follows:
- We provide a list of allowed LLMs from Huggingface - Participants should use prompting to use these LLMs as metrics for MT and summarization - Fine-tuning of the selected model(s) is not allowed - We will release baselines, which participants might build upon - We will provide a CodaLab dashboard to compare participants' solutions to others
We plan to release a CodaLab submission environment together with baselines and dev set evaluation code successively until August 7.
We will allow specific models from Huggingface, please refer to the webpage for more details: https://eval4nlp.github.io/2023/shared-task.html
Best wishes,
The Eval4NLP organizers
[*] References are listed on the shared task webpage: https://eval4nlp.github.io/2023/shared-task.html