The Third Workshop on LLM4Eval progresses the discussion from the previous series. These earlier events investigated the potential and challenges of using LLMs for search relevance evaluation, automated judgments, and retrieval-augmented generation (RAG) assessment. As modern IR systems integrate search, recommendations, conversational interfaces, and personalization, new evaluation challenges arise beyond basic relevance assessment. These applications create personalized rankings, explanations, and adapt to user preferences over time, requiring new evaluation methods.

Overview

Recent advancements in Large Language Models (LLMs) have significantly impacted evaluation methodologies in Information Retrieval (IR), reshaping the way relevance, quality, and user satisfaction are assessed. Initially demonstrating the potential for query-document relevance judgments, LLMs are now being applied to more complex tasks, including relevance label generation, assessment of retrieval-augmented generation systems, and evaluation of the quality of text-generation systems. As IR systems evolve toward more sophisticated and personalized user experiences, integrating search, recommendations, and conversational interfaces, new evaluation methodologies become necessary.

Building upon the success of our previous workshops, this third iteration of the LLM4Eval workshop at SIGIR 2025 seeks submissions exploring new opportunities, limitations, and hybrid approaches involving LLM-based evaluations.

Important Dates

Topics of interest

We invite submissions on topics including, but not limited to:

Submission guidelines

Publication options

Authors can choose between archival and non-archival options for their submissions: