[Corpora-List] Re: Second Call for Papers: The 4th Workshop on Evaluation and Comparison of NLP Systems (Eval4NLP 2023)

15 Aug 2023


      Dear Juri
Thanks for your reply.
Would you mind please addressing my question(s) at least under [i] more
explicitly? That is:
Has anyone claimed that input data does not affect results in any way?
Will there be any explicit evaluation with data statistics (in combination
with textual representations)?
and
will data used to train the LLMs be included in the evaluation? Where is
the data statistics reported?
Are there any LLM evaluation reports stating that such statistics or data
does not matter in prompting?
There are certainly more opportunities for *more rigorous and detailed
*statistical
evaluations in the "language and computing" space, esp. through
use-inspired basic experiments with more carefully controlled parameters.
Can the LLMs be reproduced? Have they been trained with full vocabulary?
These seem to be good questions to start with for our current LLM
evaluation endeavors.
There are also other methods besides using LLMs for evaluation --- if using
LLMs is "better", this deserves a more fine-grained, explicit evaluation
(as in, not just "human vs automatic"-type of narrative).
On your shared task description: "This year’s Eval4NLP shared task,
combines these two aspects. We provide a selection of open-source,
pre-trained LLMs. *The task is to develop strategies to extract scores from
these LLM’s that grade machine translations and summaries. We will
specifically focus on prompting techniques*, therefore, fine-tuning of the
LLM’s is not allowed." [1]
Would you please let me know how this works statistically? What statistical
studies/publications are there to underlie the assumption that it is
sufficient to evaluate through textual prompts? (I understand there are
"[a]n exemplary selection of recent, related, prompt-based metrics" listed
on your site and as with most CL/NLP papers there are tons of end numerical
results reported in them --- but is this practice sufficient? Even if one
might have inadvertently "normalized" the practice in the past?) And what
statistical relations are there between the statistical working/mechanics
of LLMs and the data statistical profiles of translations/summaries? It
seems if one were to just evaluate based on textual representations, it
would be all too coarse-grained (and insufficient) of an evaluation. Would
you not agree?
Please understand, also, that LLMs have now been incorporated into some
educational curriculum, for better or for worse. There is a responsibility
to report transparently, honestly, and comprehensively (i.e. not just based
on "textual eye candies" and then hide things with an onslaught of
numerical end results). It's important to not create/reinforce unnecessary
hype.
Btw, what does "sentence" in:
- "They are mostly based on black-box language models and usually return a
single quality score (sentence-level), making it difficult to explain their
internal decision process and their outputs.",
- "The task will consist of building a reference-free metric for machine
translation and/or summarization that predicts sentence-level quality
scores constructed from fine-grained scores or error labels.", and
- "The sentences that should be evaluated are wrapped into a prompt that is
passed to a language model as an input." [1] refer to here?
Line as delimited by line breaks? Or are you segmenting by "sentences"?
(Remember "all grammar leaks" and fairness matters in multilnguality!)
[1] from https://eval4nlp.github.io/2023/shared-task.html
Best
Ada
On Tue, Aug 15, 2023 at 3:58 PM Juri Opitz via Corpora <
corpora@list.elra.info> wrote:
...
Dear Ada,
thank you for the interest in our workshop topic. First, directly on the
concrete question about our shared task:
...
will data used to train the LLMs be included in the evaluation?
We try to mitigate this issue by relying on very recently created data
from sources of 2023. Maybe this also helps you for the other questions on
the shared task.
Your following questions, or, if I may, thoughts, are on the general
topics that our workshop is about. I personally think these contain valid
points and interesting reflections on evaluation and evaluation practices
in NLP, some of them I believe are touching on deeper issues quite
prevalent in NLP and ML, such as potential selection biases in human
judges. Generally speaking, Eval4NLP also warmly welcomes contributions
that discuss, introduce and extend such or similar thoughts/opinions,
critiques, reflections, outlooks, etc. on evaluation and evaluation
practices in NLP, wishing to be a forum for exchange of ideas and feedback.
Best wishes
Juri
_______________________________________________
Corpora mailing list -- corpora@list.elra.info
https://list.elra.info/mailman3/postorius/lists/corpora.list.elra.info/
To unsubscribe send an email to corpora-leave@list.elra.info

2025

2024

2023

2022

[Corpora-List] Re: Second Call for Papers: The 4th Workshop on Evaluation and Comparison of NLP Systems (Eval4NLP 2023)