Dear all
I'm looking for a guide or advice for crowd-sourcing linguistic annotations via platforms such as Mechanical Turk. I'm thinking of rating tasks such as evaluating positive and negative sentiment in sentences or annotating concordances from a corpus for a certain property (e.g. deontic v epistemic meaning in modal verbs).
Specifically, I'm wondering
- How can I ensure that the annotations are of sufficient quality? I don't have a gold standard for all the data, after all this is why I need the annotations. If I get all the data annotated by two or three independent annotators, I can ensure adequate quality. But then I might still get annotators who more or less submit random annotations (or start doing so after a while), or at least it would take me very long to find out who is doing so. - How do I find out what remuneration is adequate? - What is a good way to split up the data for annotation? Single annotation units or, say, 50 or 100 at a time? How do I deliver them effectively to the annotators?
Many thanks and best wishes Robert
Hi Robert One of my students published this paper recently which looked into some of these issues - in particular, how to ensure annotator quality, how to evaluate it, how different kinds of annotation error might impact the result (random vs consistent) and how to figure out what level of IAA is good enough for a task. We had a good experience with Amazon Mechanical Turk, which again is discussed in the paper. For example, you can set a preliminary test they have to pass first, and there are ways to incentivise them to actually do a good job.
https://aclanthology.org/2022.lrec-1.128/
Diana
On 12 Oct 2022, at 23:44, Robert Fuchs via Corpora corpora@list.elra.info wrote:
Dear all
I'm looking for a guide or advice for crowd-sourcing linguistic annotations via platforms such as Mechanical Turk. I'm thinking of rating tasks such as evaluating positive and negative sentiment in sentences or annotating concordances from a corpus for a certain property (e.g. deontic v epistemic meaning in modal verbs).
Specifically, I'm wondering
- How can I ensure that the annotations are of sufficient quality? I don't have a gold standard for all the data, after all this is why I need the annotations. If I get all the data annotated by two or three independent annotators, I can ensure adequate quality. But then I might still get annotators who more or less submit random annotations (or start doing so after a while), or at least it would take me very long to find out who is doing so.
- How do I find out what remuneration is adequate?
- What is a good way to split up the data for annotation? Single annotation units or, say, 50 or 100 at a time? How do I deliver them effectively to the annotators?
Many thanks and best wishes Robert
-- Prof. Dr. Robert Fuchs (JP) | Department of English Language and Literature/Institut für Anglistik und Amerikanistik | University of Hamburg | Überseering 35, 22297 Hamburg, Germany | Room 07076 | https://uni-hamburg.academia.edu/RobertFuchs | https://sites.google.com/view/rflinguistics/
Mailing list on varieties of English/World Englishes/ENL-ESL-EFL. Subscribe here: https://groups.google.com/forum/#!forum/var-eng/join Are you a non-native speaker of English? Please help us by taking this short survey on when and how you use the English language: https://lamapoll.de/englishusageofnonnativespeakers-1/ _______________________________________________ Corpora mailing list -- corpora@list.elra.info https://list.elra.info/mailman3/postorius/lists/corpora.list.elra.info/ To unsubscribe send an email to corpora-leave@list.elra.info
I find sets of 10 translations to be sufficiently large enough to start to become boring. I have an interest in deontic expressions. What languages are you targeting?
- Hugh Paterson III
On Thu, Oct 13, 2022 at 6:17 AM Diana G Maynard via Corpora < corpora@list.elra.info> wrote:
Hi Robert One of my students published this paper recently which looked into some of these issues - in particular, how to ensure annotator quality, how to evaluate it, how different kinds of annotation error might impact the result (random vs consistent) and how to figure out what level of IAA is good enough for a task. We had a good experience with Amazon Mechanical Turk, which again is discussed in the paper. For example, you can set a preliminary test they have to pass first, and there are ways to incentivise them to actually do a good job.
https://aclanthology.org/2022.lrec-1.128/
Diana
On 12 Oct 2022, at 23:44, Robert Fuchs via Corpora <
corpora@list.elra.info> wrote:
Dear all
I'm looking for a guide or advice for crowd-sourcing linguistic
annotations via platforms such as Mechanical Turk. I'm thinking of rating tasks such as evaluating positive and negative sentiment in sentences or annotating concordances from a corpus for a certain property (e.g. deontic v epistemic meaning in modal verbs).
Specifically, I'm wondering
- How can I ensure that the annotations are of sufficient quality? I
don't have a gold standard for all the data, after all this is why I need the annotations. If I get all the data annotated by two or three independent annotators, I can ensure adequate quality. But then I might still get annotators who more or less submit random annotations (or start doing so after a while), or at least it would take me very long to find out who is doing so.
- How do I find out what remuneration is adequate?
- What is a good way to split up the data for annotation? Single
annotation units or, say, 50 or 100 at a time? How do I deliver them effectively to the annotators?
Many thanks and best wishes Robert
-- Prof. Dr. Robert Fuchs (JP) | Department of English Language and
Literature/Institut für Anglistik und Amerikanistik | University of Hamburg | Überseering 35, 22297 Hamburg, Germany | Room 07076 | https://uni-hamburg.academia.edu/RobertFuchs | https://sites.google.com/view/rflinguistics/
Mailing list on varieties of English/World Englishes/ENL-ESL-EFL.
Subscribe here: https://groups.google.com/forum/#!forum/var-eng/join
Are you a non-native speaker of English? Please help us by taking this
short survey on when and how you use the English language: https://lamapoll.de/englishusageofnonnativespeakers-1/
Corpora mailing list -- corpora@list.elra.info https://list.elra.info/mailman3/postorius/lists/corpora.list.elra.info/ To unsubscribe send an email to corpora-leave@list.elra.info
Corpora mailing list -- corpora@list.elra.info https://list.elra.info/mailman3/postorius/lists/corpora.list.elra.info/ To unsubscribe send an email to corpora-leave@list.elra.info
Hi, One great resource is Chris Callison-Burch's class on Crowdsourcing and Human Computation:
http://crowdsourcing-class.org/
Sam Bowman's group has also published several informative papers in this space:
https://aclanthology.org/2021.findings-emnlp.421.pdf https://aclanthology.org/2021.naacl-main.385.pdf
Good luck with the project!
Jonathan Kummerfeld
-- Senior Lecturer (ie. research tenure-track Asst. Prof.) University of Sydney
e: j.k.kummerfeld@gmail.com w: www.jkk.name
On Fri, 14 Oct 2022 at 05:23, Hugh Paterson III via Corpora < corpora@list.elra.info> wrote:
I find sets of 10 translations to be sufficiently large enough to start to become boring. I have an interest in deontic expressions. What languages are you targeting?
- Hugh Paterson III
On Thu, Oct 13, 2022 at 6:17 AM Diana G Maynard via Corpora < corpora@list.elra.info> wrote:
Hi Robert One of my students published this paper recently which looked into some of these issues - in particular, how to ensure annotator quality, how to evaluate it, how different kinds of annotation error might impact the result (random vs consistent) and how to figure out what level of IAA is good enough for a task. We had a good experience with Amazon Mechanical Turk, which again is discussed in the paper. For example, you can set a preliminary test they have to pass first, and there are ways to incentivise them to actually do a good job.
https://aclanthology.org/2022.lrec-1.128/
Diana
On 12 Oct 2022, at 23:44, Robert Fuchs via Corpora <
corpora@list.elra.info> wrote:
Dear all
I'm looking for a guide or advice for crowd-sourcing linguistic
annotations via platforms such as Mechanical Turk. I'm thinking of rating tasks such as evaluating positive and negative sentiment in sentences or annotating concordances from a corpus for a certain property (e.g. deontic v epistemic meaning in modal verbs).
Specifically, I'm wondering
- How can I ensure that the annotations are of sufficient quality? I
don't have a gold standard for all the data, after all this is why I need the annotations. If I get all the data annotated by two or three independent annotators, I can ensure adequate quality. But then I might still get annotators who more or less submit random annotations (or start doing so after a while), or at least it would take me very long to find out who is doing so.
- How do I find out what remuneration is adequate?
- What is a good way to split up the data for annotation? Single
annotation units or, say, 50 or 100 at a time? How do I deliver them effectively to the annotators?
Many thanks and best wishes Robert
-- Prof. Dr. Robert Fuchs (JP) | Department of English Language and
Literature/Institut für Anglistik und Amerikanistik | University of Hamburg | Überseering 35, 22297 Hamburg, Germany | Room 07076 | https://uni-hamburg.academia.edu/RobertFuchs | https://sites.google.com/view/rflinguistics/
Mailing list on varieties of English/World Englishes/ENL-ESL-EFL.
Subscribe here: https://groups.google.com/forum/#!forum/var-eng/join
Are you a non-native speaker of English? Please help us by taking this
short survey on when and how you use the English language: https://lamapoll.de/englishusageofnonnativespeakers-1/
Corpora mailing list -- corpora@list.elra.info https://list.elra.info/mailman3/postorius/lists/corpora.list.elra.info/ To unsubscribe send an email to corpora-leave@list.elra.info
Corpora mailing list -- corpora@list.elra.info https://list.elra.info/mailman3/postorius/lists/corpora.list.elra.info/ To unsubscribe send an email to corpora-leave@list.elra.info
-- All the best, -Hugh
Sent from my iPhone _______________________________________________ Corpora mailing list -- corpora@list.elra.info https://list.elra.info/mailman3/postorius/lists/corpora.list.elra.info/ To unsubscribe send an email to corpora-leave@list.elra.info
Hi Robert,+1 to CCB's class on Crowdsourcing and Human Computation. I've consulted with 100+ annotation projects around the world, in addition to running my own projects over the last decade. Broadly speaking, we've seen every team move away from crowdsourcing the work. While it is oftentimes the lowest cost method to start with, the quality issues add up and often require multiple iterations or an outright redo. A lot of companies ultimately decide to proceed with a managed service. There are exceptions depending on the complexity and volume of work. If you're interested, I'd be happy to make introductions to vetted labeling partners.Full disclosure: I run a for-profit labeling software application, but support academic and research projects entirely free of charge. We get no commercial profits from vendor introductions.
On Thu, Oct 13, 2022 at 4:25 PM Jonathan K Kummerfeld via Corpora < corpora@list.elra.info> wrote: Hi,One great resource is Chris Callison-Burch's class on Crowdsourcing and Human Computation: http://crowdsourcing-class.org/
Sam Bowman's group has also published several informative papers in this space: https://aclanthology.org/2021.findings-emnlp.421.pdf https://aclanthology.org/2021.naacl-main.385.pdf
Good luck with the project! Jonathan Kummerfeld -- Senior Lecturer (ie. research tenure-track Asst. Prof.) University of Sydney e:j.k.kummerfeld@gmail.com w:www.jkk.name On Fri, 14 Oct 2022 at 05:23, Hugh Paterson III via Corpora < corpora@list.elra.info> wrote: I find sets of 10 translations to be sufficiently large enough to start to become boring. I have an interest in deontic expressions. What languages are you targeting? - Hugh Paterson III On Thu, Oct 13, 2022 at 6:17 AM Diana G Maynard via Corpora < corpora@list.elra.info> wrote: Hi Robert One of my students published this paper recently which looked into some of these issues - in particular, how to ensure annotator quality, how to evaluate it, how different kinds of annotation error might impact the result (random vs consistent) and how to figure out what level of IAA is good enough for a task. We had a good experience with Amazon Mechanical Turk, which again is discussed in the paper. For example, you can set a preliminary test they have to pass first, and there are ways to incentivise them to actually do a good job.
https://aclanthology.org/2022.lrec-1.128/
Diana
On 12 Oct 2022, at 23:44, Robert Fuchs via Corpora corpora@list.elra.info
wrote:
Dear all
I'm looking for a guide or advice for crowd-sourcing linguistic annotations
via platforms such as Mechanical Turk. I'm thinking of rating tasks such as evaluating positive and negative sentiment in sentences or annotating concordances from a corpus for a certain property (e.g. deontic v epistemic meaning in modal verbs).
Specifically, I'm wondering
- How can I ensure that the annotations are of sufficient quality? I don't
have a gold standard for all the data, after all this is why I need the annotations. If I get all the data annotated by two or three independent annotators, I can ensure adequate quality. But then I might still get annotators who more or less submit random annotations (or start doing so after a while), or at least it would take me very long to find out who is doing so.
- How do I find out what remuneration is adequate?
- What is a good way to split up the data for annotation? Single annotation
units or, say, 50 or 100 at a time? How do I deliver them effectively to the annotators?
Many thanks and best wishes Robert
-- Prof. Dr. Robert Fuchs (JP) | Department of English Language and
Literature/Institut für Anglistik und Amerikanistik | University of Hamburg | Überseering 35, 22297 Hamburg, Germany | Room 07076 | https://uni-hamburg.academia.edu/RobertFuchs | https://sites.google.com/view/rflinguistics/
Mailing list on varieties of English/World Englishes/ENL-ESL-EFL. Subscribe
here: https://groups.google.com/forum/#!forum/var-eng/join
Are you a non-native speaker of English? Please help us by taking this short
survey on when and how you use the English language: https://lamapoll.de/englishusageofnonnativespeakers-1/
Corpora mailing list -- corpora@list.elra.info https://list.elra.info/mailman3/postorius/lists/corpora.list.elra.info/ To unsubscribe send an email to corpora-leave@list.elra.info
_______________________________________________ Corpora mailing list -- corpora@list.elra.info https://list.elra.info/mailman3/postorius/lists/corpora.list.elra.info/ To unsubscribe send an email to corpora-leave@list.elra.info
Hi Robert,
Some other references that might be helpful:
(1) Omar Alonso (2019) "The Practice of Crowdsourcing” https://www.morganclaypool.com/doi/10.2200/S00904ED1V01Y201903ICR066 (2) Silviu Paun et al. (2022) "Statistical Methods for Annotation Analysis” https://www.morganclaypool.com/doi/abs/10.2200/S01131ED1V01Y202109HLT054?jou... (3) Essentially any work by Matt Lease https://www.ischool.utexas.edu/~ml/
Best, Udo
On 14 Oct 2022, at 01:25, Jonathan K Kummerfeld via Corpora corpora@list.elra.info wrote:
Hi, One great resource is Chris Callison-Burch's class on Crowdsourcing and Human Computation:
http://crowdsourcing-class.org/
Sam Bowman's group has also published several informative papers in this space:
https://aclanthology.org/2021.findings-emnlp.421.pdf https://aclanthology.org/2021.naacl-main.385.pdf
Good luck with the project!
Jonathan Kummerfeld
-- Senior Lecturer (ie. research tenure-track Asst. Prof.) University of Sydney
e: j.k.kummerfeld@gmail.com w: www.jkk.name
On Fri, 14 Oct 2022 at 05:23, Hugh Paterson III via Corpora corpora@list.elra.info wrote: I find sets of 10 translations to be sufficiently large enough to start to become boring. I have an interest in deontic expressions. What languages are you targeting?
- Hugh Paterson III
On Thu, Oct 13, 2022 at 6:17 AM Diana G Maynard via Corpora corpora@list.elra.info wrote: Hi Robert One of my students published this paper recently which looked into some of these issues - in particular, how to ensure annotator quality, how to evaluate it, how different kinds of annotation error might impact the result (random vs consistent) and how to figure out what level of IAA is good enough for a task. We had a good experience with Amazon Mechanical Turk, which again is discussed in the paper. For example, you can set a preliminary test they have to pass first, and there are ways to incentivise them to actually do a good job.
https://aclanthology.org/2022.lrec-1.128/
Diana
On 12 Oct 2022, at 23:44, Robert Fuchs via Corpora corpora@list.elra.info wrote:
Dear all
I'm looking for a guide or advice for crowd-sourcing linguistic annotations via platforms such as Mechanical Turk. I'm thinking of rating tasks such as evaluating positive and negative sentiment in sentences or annotating concordances from a corpus for a certain property (e.g. deontic v epistemic meaning in modal verbs).
Specifically, I'm wondering
- How can I ensure that the annotations are of sufficient quality? I don't have a gold standard for all the data, after all this is why I need the annotations. If I get all the data annotated by two or three independent annotators, I can ensure adequate quality. But then I might still get annotators who more or less submit random annotations (or start doing so after a while), or at least it would take me very long to find out who is doing so.
- How do I find out what remuneration is adequate?
- What is a good way to split up the data for annotation? Single annotation units or, say, 50 or 100 at a time? How do I deliver them effectively to the annotators?
Many thanks and best wishes Robert
-- Prof. Dr. Robert Fuchs (JP) | Department of English Language and Literature/Institut für Anglistik und Amerikanistik | University of Hamburg | Überseering 35, 22297 Hamburg, Germany | Room 07076 | https://uni-hamburg.academia.edu/RobertFuchs | https://sites.google.com/view/rflinguistics/
Mailing list on varieties of English/World Englishes/ENL-ESL-EFL. Subscribe here: https://groups.google.com/forum/#!forum/var-eng/join Are you a non-native speaker of English? Please help us by taking this short survey on when and how you use the English language: https://lamapoll.de/englishusageofnonnativespeakers-1/ _______________________________________________ Corpora mailing list -- corpora@list.elra.info https://list.elra.info/mailman3/postorius/lists/corpora.list.elra.info/ To unsubscribe send an email to corpora-leave@list.elra.info
Corpora mailing list -- corpora@list.elra.info https://list.elra.info/mailman3/postorius/lists/corpora.list.elra.info/ To unsubscribe send an email to corpora-leave@list.elra.info -- All the best, -Hugh
Sent from my iPhone _______________________________________________ Corpora mailing list -- corpora@list.elra.info https://list.elra.info/mailman3/postorius/lists/corpora.list.elra.info/ To unsubscribe send an email to corpora-leave@list.elra.info _______________________________________________ Corpora mailing list -- corpora@list.elra.info https://list.elra.info/mailman3/postorius/lists/corpora.list.elra.info/ To unsubscribe send an email to corpora-leave@list.elra.info
---- Prof. Dr. Udo Kruschwitz Lehrstuhl für Informationswissenschaft Universität Regensburg | Universitätsstraße 31 | D-93053 Regensburg Udo.Kruschwitz@ur.de
Hi Robert,
I am a big proponent of comparative annotations (like paired comparison or best--worst scaling) rather than the more commonly used rating scales. Comparative questions such as 'which item is more positive?' usually works much better than asking something like 'is this neutral or slightly positive or moderately positive or ...?". Here is some work and scripts that may be of interest: http://saifmohammad.com/WebPages/BestWorst.html
Another favorite is to intersperse a small percentage of hidden "gold" questions. Items that are pre-annotated by your team say. They are usually simple items to annotate (and not boundary cases). If one gets a large percentage of these questions wrong, then perhaps they are not the best annotators. However, always double check to see if the expert gold annotations are missing something. This is also discussed in the papers at the link above.
Finally, this just-up-on-ArXiv paper might be of interest as well: Best Practices in the Creation and Use of Emotion Lexicons https://arxiv.org/abs/2210.07206
Cheers. -Saif
On Wed, Oct 12, 2022 at 6:44 PM Robert Fuchs via Corpora < corpora@list.elra.info> wrote:
Dear all
I'm looking for a guide or advice for crowd-sourcing linguistic annotations via platforms such as Mechanical Turk. I'm thinking of rating tasks such as evaluating positive and negative sentiment in sentences or annotating concordances from a corpus for a certain property (e.g. deontic v epistemic meaning in modal verbs).
Specifically, I'm wondering
- How can I ensure that the annotations are of sufficient quality? I
don't have a gold standard for all the data, after all this is why I need the annotations. If I get all the data annotated by two or three independent annotators, I can ensure adequate quality. But then I might still get annotators who more or less submit random annotations (or start doing so after a while), or at least it would take me very long to find out who is doing so.
- How do I find out what remuneration is adequate?
- What is a good way to split up the data for annotation? Single
annotation units or, say, 50 or 100 at a time? How do I deliver them effectively to the annotators?
Many thanks and best wishes Robert
-- Prof. Dr. Robert Fuchs (JP) | Department of English Language and Literature/Institut für Anglistik und Amerikanistik | University of Hamburg | Überseering 35, 22297 Hamburg, Germany | Room 07076 | https://uni-hamburg.academia.edu/RobertFuchs | https://sites.google.com/view/rflinguistics/
Mailing list on varieties of English/World Englishes/ENL-ESL-EFL. Subscribe here: https://groups.google.com/forum/#!forum/var-eng/join Are you a non-native speaker of English? Please help us by taking this short survey on when and how you use the English language: https://lamapoll.de/englishusageofnonnativespeakers-1/ _______________________________________________ Corpora mailing list -- corpora@list.elra.info https://list.elra.info/mailman3/postorius/lists/corpora.list.elra.info/ To unsubscribe send an email to corpora-leave@list.elra.info
Hi Robert,
�
I’m not sure if this goes beyond the scope of the original question, but if you have students and an opportunity to introduce annotation tasks into the curriculum as part of coursework, then I recommend considering ‘class-sourcing’, i.e. letting students annotate and also participate in developing the guidelines.
�
Students working in the field are already experts compared to crowd workers who are unfamiliar with the underlying theories, and in my experience student response to such projects has been very positive. They often report learning a lot from it and finding it more practical and rewarding than assignments which do not have a practical outcome. I wrote a paper about doing this here:
�
https://link.springer.com/article/10.1007/s10579-016-9343-x
�
Best,
Amir
�
From: Saif Mohammad via Corpora corpora@list.elra.info Sent: Thursday, October 13, 2022 9:09 PM To: Robert Fuchs robert.fuchs.dd@googlemail.com Cc: corpora@list.elra.info Subject: [Corpora-List] Re: Query: Guide or advice for crowd-sourcing linguistic annotations
�
Hi Robert,
�
I am a big proponent of comparative annotations (like paired comparison or best--worst scaling) rather than the more commonly used rating scales. Comparative questions such as 'which item is more positive?' usually works much better than asking something like 'is this neutral or slightly positive or moderately positive or ...?".
Here is some work and scripts that may be of interest:
http://saifmohammad.com/WebPages/BestWorst.html
�
Another favorite is to intersperse a small percentage of hidden "gold" questions. Items that are pre-annotated by your team say. They are usually simple items to annotate (and not boundary cases). If one gets a large percentage of these questions wrong, then perhaps they are not the best annotators. However, always double check to see if the expert gold annotations are missing something. This is also discussed in the papers at the link above.
�
Finally, this just-up-on-ArXiv paper might be of interest as well:
Best Practices in the Creation and Use of Emotion Lexicons
https://arxiv.org/abs/2210.07206
�
Cheers.
-Saif
�
�
On Wed, Oct 12, 2022 at 6:44 PM Robert Fuchs via Corpora <corpora@list.elra.info mailto:corpora@list.elra.info > wrote:
Dear all
I'm looking for a guide or advice for crowd-sourcing linguistic annotations via platforms � such as Mechanical Turk. I'm thinking of rating tasks such as evaluating positive and negative sentiment in sentences or annotating concordances from a corpus for a certain property (e.g. deontic v epistemic meaning in modal verbs).
Specifically, I'm wondering
- How can I ensure that the annotations are of sufficient quality? I don't have a gold standard for all the data, after all this is why I need the annotations. If I get all the data annotated by two or three independent annotators, I can ensure adequate quality. But then I might still get annotators who more or less submit random annotations (or start doing so after a while), or at least it would take me very long to find out who is doing so. - How do I find out what remuneration is adequate? - What is a good way to split up the data for annotation? Single annotation units or, say, 50 or 100 at a time? How do I deliver them effectively to the annotators?
Many thanks and best �wishes Robert