We're happy to announce the release of MessIRve, a new large-scale IR dataset in Spanish!

MessIRve* contains around 730k queries from 20 Spanish-speaking countries and the United States, with relevant documents sourced from Wikipedia. MessIRve's queries reflect diverse Spanish-speaking regions, unlike other datasets that are translated from English or do not consider dialectal variations. The large size of the dataset allows it to cover a wide variety of topics, unlike smaller datasets.

The dataset is available in HuggingFace! 🤗

Queries and relevance judgments: spanish-ir/messirve
The collection of documents: spanish-ir/eswiki_20240401_corpus
Queries and qrels in TREC format: spanish-ir/messirve-trec

For more details, check out our arXiv paper: MessIRve: A Large-Scale Spanish Information Retrieval Dataset

We hope MessIRve serves to spur more work in IR for the Spanish language and facilitate the development of efficient information access tools for Spanish speakers.

* MessIRve means works for me in Spanish ("me sirve"). The reference to Lionel Messi, player of the most popular sport in Spanish-speaking countries, football, stresses the importance of using topics that are relevant to Spanish speakers.