release of thePxCorpus : A Spoken Drug Prescription Dataset in French for Spoken Language Understanding and Dialogue - Corpora

8 Nov 2023


      Dear colleagues,
We are pleased to announce the release of the PxCorpus, a 4 hours of 
transcribed and annotated dialogues of drug prescriptions in French 
acquired through an experiment with 55 participants experts and 
non-experts in drug prescriptions. This corpus was built in 
collaboration between the Laboratoire d'Informatique de Grenoble (LIG) 
the University Hospital of Grenoble (CHU Grenoble) and the Calystene 
society through a CIFRE project financed by the ANRT (Association 
Nationale de la Recherche et de la Technologie).
PxCorpus is to the best of our knowledge, the first spoken medical 
drug prescriptions corpus to be distributed.  The automatic 
transcriptions were verified by human effort and aligned with semantic 
labels to allow training of NLP models. The data acquisition protocol 
was reviewed by medical experts and permit free distribution without 
breach of privacy and regulation.
## Overview of the Corpus
The experiment has been performed in wild conditions with naive 
participants and medical experts.
In total, the dataset includes 2067 recordings of 55 participants (38% 
non-experts, 25% doctors, 36% medical practitioners), manually 
transcribed and semantically annotated.
| Category         | Sessions | Recordings | Time(m)|
|------------------| -------- | ---------- | ------ |
| Medical experts  |   258    |    434     |  94.83 |
| Doctors          |   230    |    570     | 105.21 |
| Non experts      |   415    |    977     |  62.13 |
| Total            |   903    |   1981     | 262.27 |
## License
We hope that that the community will be able to benefit from the dataset 
which is distributed with an attribution 4.0 International (CC BY 4.0) 
Creative Commons licence.
## How to cite this corpus
If you use the corpus or need more details please refer to the following 
paper: A spoken drug prescription datset in French for spoken Language 
Understanding
@InProceedings{Kocabiyikoglu2022,
   author =     "Alican Kocabiyikoglu and Fran{\c c}ois Portet and 
Prudence Gibert and Hervé Blanchon and Jean-Marc Babouchkine and Gaëtan 
Gavazzi",
   title =     "A spoken drug prescription datset in French for spoken 
Language Understanding",
   booktitle =     "13th Language Ressources and Evaluation Conference 
(LREC 2022)",
   year =     "2022",
   location =     "Marseille, France"
}
a more complete description of the corpus acquisition is available on arxiv
@misc{kocabiyikoglu2023spoken,
title={Spoken Dialogue System for Medical Prescription Acquisition 
on Smartphone: Development, Corpus and Evaluation},
author={Ali Can Kocabiyikoglu and François Portet and Jean-Marc 
Babouchkine and Prudence Gibert and Hervé Blanchon and Gaëtan Gavazzi},
year={2023},
eprint={2311.03510},
archivePrefix={arXiv},
primaryClass={cs.CL}
}
## Download
The corpus can be found in the Zenodoo Catalogue under the following 
links and references:
*PxCorpus : A Spoken Drug Prescription Dataset in French for Spoken 
Language Understanding and Dialogue*
https://zenodo.org/doi/10.5281/zenodo.6482586
-- 

François PORTET
Professeur - Univ Grenoble Alpes
Laboratoire d'Informatique de Grenoble - Équipe GETALP
Bâtiment IMAG - Office 333
700 avenue Centrale
Domaine Universitaire - 38401 St Martin d'Hères
FRANCE

Phone:  +33 (0)4 57 42 15 44
Email:francois.portet@imag.fr
www:http://membres-liglab.imag.fr/portet/