[Corpora-List] MessIRve: a large-scale Spanish Information Retrieval dataset

11 Sep 2024


      We're happy to announce the release of *MessIRve*, a new *large-scale IR
dataset in Spanish!*
MessIRve* contains around *730k queries from 20 Spanish-speaking
countries* *and
the United States*, with relevant documents sourced from Wikipedia.
MessIRve's queries reflect diverse Spanish-speaking regions, unlike other
datasets that are translated from English or do not consider dialectal
variations. The large size of the dataset allows it to cover a wide variety
of topics, unlike smaller datasets.
The dataset is available in *HuggingFace*! 🤗
- Queries and relevance judgments: spanish-ir/messirve
   https://huggingface.co/datasets/spanish-ir/messirve
   - The collection of documents:  spanish-ir/eswiki_20240401_corpus
   https://huggingface.co/datasets/spanish-ir/eswiki_20240401_corpus
   - Queries and qrels in TREC format: spanish-ir/messirve-trec
   https://huggingface.co/datasets/spanish-ir/messirve-trec
For more details, check out our *arXiv paper*: MessIRve: A Large-Scale
Spanish Information Retrieval Dataset http://arxiv.org/abs/2409.05994
We hope MessIRve serves to spur more work in IR for the Spanish language
and facilitate the development of efficient information access tools for
Spanish speakers.
* MessIRve means *works** for **me* in Spanish ("me sirve"). The reference
to Lionel Messi, player of the most popular sport in Spanish-speaking
countries, football, stresses the importance of using topics that are
relevant to Spanish speakers.

2026

2025

2024

2023

2022

[Corpora-List] MessIRve: a large-scale Spanish Information Retrieval dataset