We're happy to announce the release of *MessIRve*, a new *large-scale IR dataset in Spanish!*
MessIRve* contains around *730k queries from 20 Spanish-speaking countries* *and the United States*, with relevant documents sourced from Wikipedia. MessIRve's queries reflect diverse Spanish-speaking regions, unlike other datasets that are translated from English or do not consider dialectal variations. The large size of the dataset allows it to cover a wide variety of topics, unlike smaller datasets.
The dataset is available in *HuggingFace*! 🤗
- Queries and relevance judgments: spanish-ir/messirve https://huggingface.co/datasets/spanish-ir/messirve - The collection of documents: spanish-ir/eswiki_20240401_corpus https://huggingface.co/datasets/spanish-ir/eswiki_20240401_corpus - Queries and qrels in TREC format: spanish-ir/messirve-trec https://huggingface.co/datasets/spanish-ir/messirve-trec
For more details, check out our *arXiv paper*: MessIRve: A Large-Scale Spanish Information Retrieval Dataset http://arxiv.org/abs/2409.05994
We hope MessIRve serves to spur more work in IR for the Spanish language and facilitate the development of efficient information access tools for Spanish speakers.
* MessIRve means *works** for **me* in Spanish ("me sirve"). The reference to Lionel Messi, player of the most popular sport in Spanish-speaking countries, football, stresses the importance of using topics that are relevant to Spanish speakers.