Training set released!
SECOND CALL FOR PARTICIPATION
IberLEF 2023 Task - PoliticEs: Political ideology detection in Spanish texts
Held as part of the evaluation forum IberLEF 2023 https://sites.google.com/view/iberlef-2023 in the XXXIX edition of the International Conference of the Spanish Society for Natural Language Processing (SEPLN 2023 http://sepln2023.sepln.org/en/home/)
September 26, 2023. Jaén, Andalusia, Spain
Codalab link: https://codalab.lisn.upsaclay.fr/competitions/10173
Dear All,
We are inviting researchers and students to participate in the shared-task PoliticEs 2023: Political ideology detection in Spanish texts, held as part of IberLEF 2023, the shared evaluation campaign for Natural Language Processing systems in Spanish and other Iberian languages, collocated with SEPLN 2023 Conference.
The goal of this task is to extract political ideology information from Spanish texts. For this, an automatic document classification task on clusters of texts is proposed. It consists of extracting the self-assigned gender and profession as demographic traits, and the political ideology as a psychographic trait from a set of texts written in Spanish from several authors that share those traits. Political ideology is considered as a binary and as a multiclass problem. The PoliticES 2023 shared task is based on a previous task named PoliticES 2022 presented at IberLEF2022 (García-Díaz et. al. 2022b) where the dataset was an extension of the PoliCorpus 2020 dataset (García-Díaz et al., 2022a). The novelty of this year is that participants will work with clusters of texts written by different users, but with the same traits, instead of profiling users to prevent legal and ethical issues.
The participants will be provided development, development_test, training and test datasets in Spanish from an extension of the PoliCorpus 2020 (García-Díaz et al., 2022) and the corpus used for the PoliticES 2022 shared task (García-Díaz et. al. 2022b). The dataset was collected between 2020 and 2022 from the Twitter accounts of politicians, political journalists and celebrities in Spain using the UMUCorpusClassifier (García-Díaz et al., 2020). We automatically created clusters of texts mixing some of these extracted tweets in order to prevent ethical and privacy issues about author profiling in Twitter. Each cluster is composed of 80 tweets written by different users that share all the traits under evaluation. We labeled each cluster with the self-assigned gender (male, female), profession (politician, celebrity, journalist) and political spectrum on two axes: binary (left, right) and multiclass (left, moderate_left, moderate_right, right). Moreover, the Twitter mentions of the politicians were anonymised by replacing them with the token @user. In addition, other Twitter accounts mentions were also encoded as @user. Other entities, such as political party references, are also replaced with the @political_party token. Consequently, the text traits cannot be guessed trivially by reading the user's name and searching information on them on the Internet. The dataset is composed of approximately 2800 different clusters.
Finally, in order to facilitate participation in the competition, a notebook with two baselines will be provided. The first one will be based on BoW and the second one will be based on Transformers. To download the data, the notebook and participate, go to https://codalab.lisn.upsaclay.fr/competitions/10173.
Yesterday, we released the training dataset that can be found in the "Files" subsection of the "Participate" tab. It is worth mentioning that this dataset includes all the instances that were also released during the Practice stage; so, it is not needed to combine both datasets.
Finally, remember that the CodaLab competition is open to submit your results with the development dataset provided. This dataset is also available in the same section as the training dataset.
Best regards,
The PoliticES 2023 organizing committee
References
-
García-Díaz, J. A., Almela, Á., Alcaraz-Mármol, G., & Valencia-García, R. (2020). UMUCorpusClassifier: Compilation and evaluation of linguistic corpus for Natural Language Processing tasks. Procesamiento del Lenguaje Natural, 65, 139-142. -
García-Díaz, J. A., Colomo-Palacios, R., & Valencia-García, R. (2022a). Psychographic traits identification based on political ideology: An author analysis study on Spanish politicians’ tweets posted in 2020. Future Generation Computer Systems, 130(1), 59-74. -
García-Díaz, J. A., Jiménez Zafra, S. M., Martín Valdivia, M. T., García-Sánchez, F., Ureña López, L. A., & Valencia García, R. (2022b). Overview of PoliticEs 2022: Spanish Author Profiling for Political Ideology. Procesamiento del Lenguaje Natural, 69, 265-272.
Important dates
-
Release of development corpora: Feb 13, 2023 -
Release of training corpora: Mar 13, 2023 -
Release of test corpora and start of evaluation campaign: Apr 17, 2023 -
End of evaluation campaign (deadline for runs submission): May 3, 2023 -
Publication of official results: May 5, 2023 -
Paper submission: May 29, 2023 -
Review notification: Jun 17, 2023 -
Camera ready submission: Jun 27, 2023 -
IberLEF Workshop (SEPLN 2023): Sep 26, 2023 (Jaén, Andalusia, Spain) -
Publication of proceedings: Sep ??, 2023
Organizing committee
-
José Antonio García-Díaz (UMUTeam, Universidad de Murcia) -
Salud María Jiménez-Zafra (SINAI, Universidad de Jaén) -
María-Teresa Martín Valdivia (SINAI, Universidad de Jaén) -
Francisco García-Sánchez (UMUTeam, Universidad de Murcia) -
L. Alfonso Ureña-López (SINAI, Universidad de Jaén) -
Rafael Valencia-García (UMUTeam, Universidad de Murcia)
[image: Universidad de Jaén] http://www.uja.es/ *Salud María Jiménez Zafra* sjzafra@ujaen.es
Universidad de Jaén Grupo de Investigación SINAI http://sinai.ujaen.es/ | Departamento de Informática EPS Jaén, Edificio A3, Despacho 219 Campus Las Lagunillas s/n 23071 - Jaén | +34 953212992
[image: Universidad de Jaén] http://www.uja.es/