TEI has a *said* element type with a *who* attribute that can be used to encode this information.
Alternatively you can use standoff annotation (which is particularly helpful if you are doing a lot of other annotations on the same base text).
We've done the latter at the Digital Tolkien Project and I've used it to contrast the style of different characters (as well as change throughout a novel).
James
On Fri, Jan 24, 2025 at 7:46 AM Darren Cook via Corpora < corpora@list.elra.info> wrote:
Is there any established xml or other markup language for novels and short stories?
I'm particularly interested in marking up dialogue with the name of the character who is speaking, and then in tools that allow extracting the dialogue of each character (e.g. to analyse and contrast the vocabulary each uses).
If so, following on from that, are there open-source ML models that try to identify the speaker to add this markup, and existing training data?
Thanks, Darren
Corpora mailing list -- corpora@list.elra.info https://list.elra.info/mailman3/postorius/lists/corpora.list.elra.info/ To unsubscribe send an email to corpora-leave@list.elra.info