Dear Friends and Colleagues,
Whether you are a champ or skeptic of LLMs, they are here to stay. The ending of the movie is still unknown. We believe a critical aspect in the plot development will be the progress we make in the evaluation process of LLMs both in the training and testing ( inference ) phases. We lack appropriate, shared, and transparent evaluation methodologies and metrics.
We would like to invite you to review and contribute to the proposed open model for LLM evaluations. It is a framework you can use as-is, in part, or contribute to modifying and extending.
Article: Evaluation of Response Generation Models: Shouldn’t It Be Shareable and Replicable? https://aclanthology.org/2022.gem-1.12/. In Proceedings of the 2nd GEM Workshop @EMNLP 2022. Repo (codes, UI, guidelines, etc.): https://github.com/sislab-unitn/Human-Evaluation-Protocol
Publications utilizing the proposed protocol: 1. Response Generation in Longitudinal Dialogues: Which Knowledge Representation Helps? https://aclanthology.org/2023.nlp4convai-1.1/ (Mousavi et al., NLP4ConvAI 2023) 2. Are LLMs Robust for Spoken Dialogues? https://arxiv.org/abs/2401.02297 (Mousavi et al., IWSDS2024)
Best Regards
---- Prof. Dr.-Ing. Giuseppe Riccardi Founder and Director of the Signals and Interactive Systems Lab Department of Computer Science and Engineering Department University of Trento