Dear Juri
Thanks for your reply.
Would you mind please addressing my question(s) at least under [i] more explicitly? That is: Has anyone claimed that input data does not affect results in any way? Will there be any explicit evaluation with data statistics (in combination with textual representations)? and will data used to train the LLMs be included in the evaluation? Where is the data statistics reported? Are there any LLM evaluation reports stating that such statistics or data does not matter in prompting?
There are certainly more opportunities for *more rigorous and detailed *statistical evaluations in the "language and computing" space, esp. through use-inspired basic experiments with more carefully controlled parameters. Can the LLMs be reproduced? Have they been trained with full vocabulary? These seem to be good questions to start with for our current LLM evaluation endeavors. There are also other methods besides using LLMs for evaluation --- if using LLMs is "better", this deserves a more fine-grained, explicit evaluation (as in, not just "human vs automatic"-type of narrative).
On your shared task description: "This year’s Eval4NLP shared task, combines these two aspects. We provide a selection of open-source, pre-trained LLMs. *The task is to develop strategies to extract scores from these LLM’s that grade machine translations and summaries. We will specifically focus on prompting techniques*, therefore, fine-tuning of the LLM’s is not allowed." [1] Would you please let me know how this works statistically? What statistical studies/publications are there to underlie the assumption that it is sufficient to evaluate through textual prompts? (I understand there are "[a]n exemplary selection of recent, related, prompt-based metrics" listed on your site and as with most CL/NLP papers there are tons of end numerical results reported in them --- but is this practice sufficient? Even if one might have inadvertently "normalized" the practice in the past?) And what statistical relations are there between the statistical working/mechanics of LLMs and the data statistical profiles of translations/summaries? It seems if one were to just evaluate based on textual representations, it would be all too coarse-grained (and insufficient) of an evaluation. Would you not agree?
Please understand, also, that LLMs have now been incorporated into some educational curriculum, for better or for worse. There is a responsibility to report transparently, honestly, and comprehensively (i.e. not just based on "textual eye candies" and then hide things with an onslaught of numerical end results). It's important to not create/reinforce unnecessary hype.
Btw, what does "sentence" in: - "They are mostly based on black-box language models and usually return a single quality score (sentence-level), making it difficult to explain their internal decision process and their outputs.", - "The task will consist of building a reference-free metric for machine translation and/or summarization that predicts sentence-level quality scores constructed from fine-grained scores or error labels.", and - "The sentences that should be evaluated are wrapped into a prompt that is passed to a language model as an input." [1] refer to here? Line as delimited by line breaks? Or are you segmenting by "sentences"? (Remember "all grammar leaks" and fairness matters in multilnguality!)
[1] from https://eval4nlp.github.io/2023/shared-task.html
Best Ada
On Tue, Aug 15, 2023 at 3:58 PM Juri Opitz via Corpora < corpora@list.elra.info> wrote:
Dear Ada,
thank you for the interest in our workshop topic. First, directly on the concrete question about our shared task:
will data used to train the LLMs be included in the evaluation?
We try to mitigate this issue by relying on very recently created data from sources of 2023. Maybe this also helps you for the other questions on the shared task.
Your following questions, or, if I may, thoughts, are on the general topics that our workshop is about. I personally think these contain valid points and interesting reflections on evaluation and evaluation practices in NLP, some of them I believe are touching on deeper issues quite prevalent in NLP and ML, such as potential selection biases in human judges. Generally speaking, Eval4NLP also warmly welcomes contributions that discuss, introduce and extend such or similar thoughts/opinions, critiques, reflections, outlooks, etc. on evaluation and evaluation practices in NLP, wishing to be a forum for exchange of ideas and feedback.
Best wishes
Juri _______________________________________________ Corpora mailing list -- corpora@list.elra.info https://list.elra.info/mailman3/postorius/lists/corpora.list.elra.info/ To unsubscribe send an email to corpora-leave@list.elra.info