Dear colleagues,
We are pleased to announce the release of the PxCorpus, a 4 hours of transcribed and annotated dialogues of drug prescriptions in French acquired through an experiment with 55 participants experts and non-experts in drug prescriptions. This corpus was built in collaboration between the Laboratoire d'Informatique de Grenoble (LIG) the University Hospital of Grenoble (CHU Grenoble) and the Calystene society through a CIFRE project financed by the ANRT (Association Nationale de la Recherche et de la Technologie).
PxCorpus is to the best of our knowledge, the first spoken medical drug prescriptions corpus to be distributed. The automatic transcriptions were verified by human effort and aligned with semantic labels to allow training of NLP models. The data acquisition protocol was reviewed by medical experts and permit free distribution without breach of privacy and regulation.
## Overview of the Corpus The experiment has been performed in wild conditions with naive participants and medical experts. In total, the dataset includes 2067 recordings of 55 participants (38% non-experts, 25% doctors, 36% medical practitioners), manually transcribed and semantically annotated.
| Category | Sessions | Recordings | Time(m)|
|------------------| -------- | ---------- | ------ |
| Medical experts | 258 | 434 | 94.83 |
| Doctors | 230 | 570 | 105.21 |
| Non experts | 415 | 977 | 62.13 |
| Total | 903 | 1981 | 262.27 |
## License We hope that that the community will be able to benefit from the dataset which is distributed with an attribution 4.0 International (CC BY 4.0) Creative Commons licence.
## How to cite this corpus
If you use the corpus or need more details please refer to the following paper: A spoken drug prescription datset in French for spoken Language Understanding
@InProceedings{Kocabiyikoglu2022, author = "Alican Kocabiyikoglu and Fran{\c c}ois Portet and Prudence Gibert and Hervé Blanchon and Jean-Marc Babouchkine and Gaëtan Gavazzi", title = "A spoken drug prescription datset in French for spoken Language Understanding", booktitle = "13th Language Ressources and Evaluation Conference (LREC 2022)", year = "2022", location = "Marseille, France" }
a more complete description of the corpus acquisition is available on arxiv
@misc{kocabiyikoglu2023spoken,
title={Spoken Dialogue System for Medical Prescription Acquisition on Smartphone: Development, Corpus and Evaluation},
author={Ali Can Kocabiyikoglu and François Portet and Jean-Marc Babouchkine and Prudence Gibert and Hervé Blanchon and Gaëtan Gavazzi},
year={2023},
eprint={2311.03510},
archivePrefix={arXiv},
primaryClass={cs.CL}
}
## Download
The corpus can be found in the Zenodoo Catalogue under the following links and references:
*PxCorpus : A Spoken Drug Prescription Dataset in French for Spoken Language Understanding and Dialogue*
https://zenodo.org/doi/10.5281/zenodo.6482586