Hi Darren,
�
In the GUM corpus https://gucorpling.org/gum/ , which includes fiction chapters and short stories, we’ve also used who/whom annotations with the TEI tag <sp> for speaker, like this:
�
...
<sp who="#Pag" whom="#Siri"> <s type="decl"> " Oh <hi rend="italic"> man </hi> are we in trouble . " </s> </sp>
</p>
<sp who="#Siri" whom="#Pag">
<p> <s type="decl">
" They started it . "
</s>
...
�
The data is also available with dependency parses in the conllu format, where speaker and addressee comments reflect the same information:
�
# speaker = Siri
# addressee = Pag
# text = "They started it."
1 " " PUNCT `` _ 3 punct 3:punct Discourse=evaluation-comment:152->151:0:_|SpaceAfter=No
2 They they PRON PRP Case=Nom|Number=Plur|Person=3|PronType=Prs 3 nsubj 3:nsubj Entity=(35-person-giv:inact-cf1-1-ana)
3 started start VERB VBD Mood=Ind|Number=Plur|Person=3|Tense=Past|VerbForm=Fin 0 root 0:root MSeg=start-ed
4 it it PRON PRP Case=Acc|Gender=Neut|Number=Sing|Person=3|PronType=Prs 3 obj 3:obj Entity=(34-event-giv:inact-cf2-1-coref)|SpaceAfter=No
5 . . PUNCT . _ 3 punct 3:punct SpaceAfter=No
6 " " PUNCT '' _ 3 punct 3:punct _
�
�
For speech that can be attributed to a speaker but without actual speech happening (e.g. “According to Bob Dylan, behind every beautiful thing, there's some kind of pain”), we also have explicit attribution relation annotations in the framework of Enhanced Rhetorical Structure Theory (eRST https://gucorpling.org/erst/ ), and there are similar annotations for attributions in the framework of the Penn Discourse Treebank as well.
�
Hope that’s helpful!
Amir
------------
Dr. Amir Zeldes
Assoc. Prof. of Computational Linguistics
Department of Linguistics
Georgetown University
1437 37th St. NW
Washington, DC 20057
�
�
�
�
�
From: James Tauber via Corpora corpora@list.elra.info Sent: Saturday, January 25, 2025 12:29 AM To: Darren Cook darren@dcook.org Cc: corpora@list.elra.info Subject: [Corpora-List] Re: Story markup languages
�
TEI has a said element type with a who attribute that can be used to encode this information.
�
Alternatively you can use standoff annotation (which is particularly helpful if you are doing a lot of other annotations on the same base text).
�
We've done the latter at the Digital Tolkien Project and I've used it to contrast �the style of different characters (as well as change throughout a novel).
�
James
�
�
On Fri, Jan 24, 2025 at 7:46 AM Darren Cook via Corpora <corpora@list.elra.info mailto:corpora@list.elra.info > wrote:
Is there any established xml or other markup language for novels and short stories?
I'm particularly interested in marking up dialogue with the name of the character who is speaking, and then in tools that allow extracting the dialogue of each character (e.g. to analyse and contrast the vocabulary each uses).
If so, following on from that, are there open-source ML models that try to identify the speaker to add this markup, and existing training data?
Thanks, Darren
_______________________________________________ Corpora mailing list -- corpora@list.elra.info mailto:corpora@list.elra.info https://list.elra.info/mailman3/postorius/lists/corpora.list.elra.info/ https://www.google.com/url?q=https://list.elra.info/mailman3/postorius/lists/corpora.list.elra.info/&source=gmail-imap&ust=1738387855000000&usg=AOvVaw0X3TmzGCUB7hM3Mfz3a2e8 To unsubscribe send an email to corpora-leave@list.elra.info mailto:corpora-leave@list.elra.info