Dear corpora-list members,

We are announcing the first SemEval shared task on Semantic Textual Relatedness (STR): A shared task on automatically detecting the degree of semantic relatedness (closeness in meaning) between pairs of sentences. 

The semantic relatedness of two language units has long been considered fundamental to understanding meaning (Halliday and Hasan, 1976; Miller and Charles, 1991), and automatically determining relatedness has many applications such as evaluating sentence representation methods, question answering, and summarization. 

Two sentences are considered semantically similar when they have a paraphrasal or entailment relation. On the other hand, relatedness is a much broader concept that accounts for all the commonalities between two sentences: whether they are on the same topic, express the same view, originate from the same time period, one elaborates on (or follows from) the other, etc. For instance, for the following sentence pairs:

  •  Pair 1: a. There was a lemon tree next to the house. b. The boy enjoyed reading under the lemon tree.

  • Pair 2: a. There was a lemon tree next to the house. b. The boy was an excellent football player.

Most people will agree that the sentences in pair 1 are more related than the sentences in pair 2. 

In this task, new textual datasets will be provided for Afrikaans, Algerian Arabic, Amharic, English, Hausa, Hindi, Indonesian, Kinyarwanda, Marathi, Moroccan Arabic, Modern Standard Arabic, Punjabi, Spanish, and Telugu.


Data

Each instance in the training, development, and test sets is a sentence pair. The instance is labeled with a score representing the degree of semantic textual relatedness between the two sentences. The scores can range from 0 (maximally unrelated) to 1 (maximally related). These gold label scores have been determined through manual annotation. Specifically, a comparative annotation approach was used to avoid known limitations of traditional rating scale annotation methods This comparative annotation process (which avoids several biases of traditional rating scales) led to a high reliability of the final relatedness rankings.

Further details about the task, the method of data annotation, how STR is different from semantic textual similarity, applications of semantic textual relatedness, etc. can be found in this paper: https://aclanthology.org/2023.eacl-main.55.pdf

Tracks

Each team can provide submissions for one, two or all of the tracks shown below:

Track A: Supervised

Participants are to submit systems that have been trained using the labeled training datasets provided. Participating teams are allowed to use any publicly available datasets (e.g., other relatedness and similarity datasets or datasets in any other languages). However, they must report additional data they used, and ideally report how impactful each resource was on the final results.

Track B: Unsupervised

Participants are to submit systems that have been developed without the use of any labeled datasets pertaining to semantic relatedness or semantic similarity between units of text more than two words long in any language. The use of unigram or bigram relatedness datasets (from any language) is permitted. 

Track C: Cross-lingual

Participants are to submit systems that have been developed without the use of any labeled semantic similarity or semantic relatedness datasets in the target language and with the use of labeled dataset(s) from at least one other language.  Note: Using labeled data from another track is mandatory for submission to this track. 

Deciding which track a submission should go to:

  • If a submission uses labeled data in the target language: submit to Track A

  • If a submission does not use labeled data in the target language but uses labeled data from another language: submit to Track C

  • If a submission does not use labeled data in any language: submit to Track B

** Here ‘labeled data’ refers to labeled datasets pertaining to semantic relatedness or semantic similarity between units of text more than two words long.

Evaluation

The official evaluation metric for this task is the Spearman rank correlation coefficient, which captures how well the system-predicted rankings of test instances align with human judgments. You can find the evaluation script for this shared task on our Github page.

Helpful Links


Important Dates

  • Training data ready: 11 September 2023

  • Evaluation Starts: 20 January 2024

  • Evaluation End: 31 January 2024

  • System Description Paper Due: 19 February 2024

  • Notification of acceptance: 01 April 2024
  • Camera-ready Due: 22 April 2024

  • SemEval workshop: 16-21 June (co-located with NAACL 2024)

NB. We will organise a QA mentorship session next week (mid-January 2024) and a system description writing tutorial in February for all participants, especially students and junior researchers. The zoom links will be shared by email and on Slack.

References

  • Shima Asaadi, Saif Mohammad, Svetlana Kiritchenko. 2019. Big BiRD: A Large, Fine-Grained, Bigram Relatedness Dataset for Examining Semantic Composition. Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies.

  • M. A. K. Halliday and R. Hasan. 1976. Cohesion in English. London: Longman.

  • George A Miller and Walter G Charles. 1991. Contextual Correlates of Semantic Similarity. Language and Cognitive Processes, 6(1):1–28

  • Mohamed Abdalla, Krishnapriya Vishnubhotla, and Saif Mohammad. 2023. What Makes Sentences Semantically Related? A Textual Relatedness Dataset and Empirical Study. In Proceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics, pages 782–796, Dubrovnik, Croatia. Association for Computational Linguistics.

Task Organizers

Nedjma Ousidhoum

Shamsuddeen Hassan Muhammad

Mohamed Abdalla

Krishnapriya Vishnubhotla

Vladimir Araujo

Meriem Beloucif

Idris Abdulmumin

Seid Muhie Yimam

Nirmal Surange

Christine De Kock

Sanchit Ahuja

Oumaima Hourrane

Manish Shrivastava

Alham Fikri Aji

Thamar Solorio

Saif M. Mohammad