The next meeting of the Edge Hill Corpus Research Group will take place online (via MS Teams) on Thursday 25 April 2024, 2:00-3:30 pm (UK time).
Registration closes tomorrow (Wednesday 24 April), 11 am.
Attendance is free. You can register here: https://store.edgehill.ac.uk/conferences-and-events/conferences/events/edge-...
Topics: Corpus Methodology, Large Language Models
Speakers: Sylvia Jaworskahttps://www.reading.ac.uk/elal/staff/dr-sylvia-jaworska (University of Reading, UK) & Mathew Gillingshttps://www.wu.ac.at/ebc/about-us/team/mathew-gillings/ (Vienna University of Economics and Business, Austria)
Title: How humans vs. machines identify discourse topics: an exploratory triangulation
Abstract
Identifying discourses and discursive topics in a set of texts has not only been of interest to linguists, but to researchers working across social sciences. Traditionally, these analyses have been conducted based on small-scale interpretive analyses of discourse which involve some form of close reading. Naturally, however, that close reading is only possible when the dataset is small, and it leaves the analyst open to accusations of bias and/or cherry-picking.
Designed to avoid these issues, other methods have emerged which involve larger datasets and have some form of quantitative component. Within linguistics, this has typically been through the use of corpus-assisted methods, whilst outside of linguistics, topic modelling is one of the most widely-used approaches. Increasingly, researchers are also exploring the utility of LLMs (such as ChatGPT) to assist analyses and identification of topics. This talk reports on a study assessing the effect that analytical method has on the interpretation of texts, specifically in relation to the identification of the main topics. Using a corpus of corporate sustainability reports, totalling 98,277 words, we asked 6 different researchers, along with ChatGPT, to interrogate the corpus and decide on its main ‘topics’ via four different methods. Each method gradually increases in the amount of context available.
• Method A: ChatGPT is used to categorise the topic model output and assign topic labels; • Method B: Two researchers were asked to view a topic model output and assign topic labels based purely on eyeballing the co-occurring words; • Method C: Two researchers were asked to assign topic labels based on a concordance analysis of 100 randomised lines of each co-occurring word; • Method D: Two researchers were asked to reverse-engineer a topic model output by creating topic labels based on a close reading.
The talk explores how the identified topics differed both between researchers in the same condition, and between researchers in different conditions shedding light on some of the mechanisms underlying topic identification by machines vs humans or machines assisted by humans. We conclude with a series of tentative observations regarding the benefits and limitations of each method along with suggestions for researchers in selecting an analytical approach for discourse topic identification. While this study is exploratory and limited in scope, it opens up a way for further methodological and larger scale triangulations of corpus-based analyses with other computational methods including AI.
________________________________ Edge Hill Universityhttp://ehu.ac.uk/home/emailfooter Modern University of the Year, The Times and Sunday Times Good University Guide 2022http://ehu.ac.uk/tef/emailfooter University of the Year, Educate North 2021/21 ________________________________ This message is private and confidential. If you have received this message in error, please notify the sender and remove it from your system. Any views or opinions presented are solely those of the author and do not necessarily represent those of Edge Hill or associated companies. Edge Hill University may monitor email traffic data and also the content of email for the purposes of security and business communications during staff absence.http://ehu.ac.uk/itspolicies/emailfooter
Edge Hill Corpus Research Group, Thursday 25 April 2024
The presentation slides are now available: https://sites.edgehill.ac.uk/crg/files/2024/04/EHU-CRG.2024.04.25.JaworskaGi...
Topics: Corpus Methodology, Large Language Models
Speakers: Sylvia Jaworskahttps://www.reading.ac.uk/elal/staff/dr-sylvia-jaworska (University of Reading, UK) & Mathew Gillingshttps://www.wu.ac.at/ebc/about-us/team/mathew-gillings/ (Vienna University of Economics and Business, Austria)
Title: How humans vs. machines identify discourse topics: an exploratory triangulation
From: Costas Gabrielatos via Corpora corpora@list.elra.info Sent: Tuesday, April 23, 2024 10:52 AM To: CORPORA LIST ELRA (corpora@list.elra.info) corpora@list.elra.info Subject: [Corpora-List] Edge Hill Corpus Research Group – Meeting #12
CAUTION: This email originated from outside of the organisation. Do not click links or open attachments unless you recognise the sender and believe the content to be safe. The next meeting of the Edge Hill Corpus Research Group will take place online (via MS Teams) on Thursday 25 April 2024, 2:00-3:30 pm (UK time).
Registration closes tomorrow (Wednesday 24 April), 11 am.
Attendance is free. You can register here: https://store.edgehill.ac.uk/conferences-and-events/conferences/events/edge-...
Topics: Corpus Methodology, Large Language Models
Speakers: Sylvia Jaworskahttps://www.reading.ac.uk/elal/staff/dr-sylvia-jaworska (University of Reading, UK) & Mathew Gillingshttps://www.wu.ac.at/ebc/about-us/team/mathew-gillings/ (Vienna University of Economics and Business, Austria)
Title: How humans vs. machines identify discourse topics: an exploratory triangulation
Abstract
Identifying discourses and discursive topics in a set of texts has not only been of interest to linguists, but to researchers working across social sciences. Traditionally, these analyses have been conducted based on small-scale interpretive analyses of discourse which involve some form of close reading. Naturally, however, that close reading is only possible when the dataset is small, and it leaves the analyst open to accusations of bias and/or cherry-picking.
Designed to avoid these issues, other methods have emerged which involve larger datasets and have some form of quantitative component. Within linguistics, this has typically been through the use of corpus-assisted methods, whilst outside of linguistics, topic modelling is one of the most widely-used approaches. Increasingly, researchers are also exploring the utility of LLMs (such as ChatGPT) to assist analyses and identification of topics. This talk reports on a study assessing the effect that analytical method has on the interpretation of texts, specifically in relation to the identification of the main topics. Using a corpus of corporate sustainability reports, totalling 98,277 words, we asked 6 different researchers, along with ChatGPT, to interrogate the corpus and decide on its main ‘topics’ via four different methods. Each method gradually increases in the amount of context available.
* Method A: ChatGPT is used to categorise the topic model output and assign topic labels; * Method B: Two researchers were asked to view a topic model output and assign topic labels based purely on eyeballing the co-occurring words; * Method C: Two researchers were asked to assign topic labels based on a concordance analysis of 100 randomised lines of each co-occurring word; * Method D: Two researchers were asked to reverse-engineer a topic model output by creating topic labels based on a close reading.
The talk explores how the identified topics differed both between researchers in the same condition, and between researchers in different conditions shedding light on some of the mechanisms underlying topic identification by machines vs humans or machines assisted by humans. We conclude with a series of tentative observations regarding the benefits and limitations of each method along with suggestions for researchers in selecting an analytical approach for discourse topic identification. While this study is exploratory and limited in scope, it opens up a way for further methodological and larger scale triangulations of corpus-based analyses with other computational methods including AI.
________________________________ Edge Hill Universityhttp://ehu.ac.uk/home/emailfooter Modern University of the Year, The Times and Sunday Times Good University Guide 2022http://ehu.ac.uk/tef/emailfooter University of the Year, Educate North 2021/21 ________________________________ This message is private and confidential. If you have received this message in error, please notify the sender and remove it from your system. Any views or opinions presented are solely those of the author and do not necessarily represent those of Edge Hill or associated companies. Edge Hill University may monitor email traffic data and also the content of email for the purposes of security and business communications during staff absence.http://ehu.ac.uk/itspolicies/emailfooter ________________________________ Edge Hill Universityhttp://ehu.ac.uk/home/emailfooter Modern University of the Year, The Times and Sunday Times Good University Guide 2022http://ehu.ac.uk/tef/emailfooter University of the Year, Educate North 2021/21 ________________________________ This message is private and confidential. If you have received this message in error, please notify the sender and remove it from your system. Any views or opinions presented are solely those of the author and do not necessarily represent those of Edge Hill or associated companies. Edge Hill University may monitor email traffic data and also the content of email for the purposes of security and business communications during staff absence.http://ehu.ac.uk/itspolicies/emailfooter
Language Technologies and Digital Humanities: Resources and Applications (LTаDH-RA) CLaDA-BG 2024 Conference Sofia, Bulgaria 26-28 June 2024
Deadline is extended to 08.05.2024
https://clada-bg.eu/en/dissemination/events/international-clada-bg-conferenc... Topics of Interest The topics include, but are not limited to, the following ones: • Problems in SS&H – research methods, technological support • Language technologies for sentiment analysis, semantic technologies, trust-worthiness of knowledge graphs, ethical challenges in digital SS&H • Knowledge Modeling and Elicitation for digital SS&H • Specific Language Resources and Technologies for historical texts, parliamentary records, speech and multimodal corpora, social media data • The role of digital libraries, archives and museums in digital SS&H research • Language Interface to Knowledge Graphs in SS&H • Knowledge-modeled and linked applications in SS&H • Large Language Models in DH • Best practices and new trends in Knowledge Modeling and Linking for language, culture and history Invited Speakers • Darja Fišer, CLARIN ERIC. ParlaMint: From Democracy to Data and back • Maciej Piasecki, CLARIN-PL, Wroclaw University of Science and Technology, Poland. • Pia Sommerauer, CLTL, VU University Amsterdam, the Netherlands. Know What You Are Modeling: Why We Need Interdisciplinary Perspectives to Understand Large Language Models • Tanja Wissik, ACDH-CH, Austrian Academy of Sciences, Austria
Important Dates Submission deadline extension: 08.05.2024 Notification of acceptance: 24.05.2024 Final Submission: 20.06.2024 Conference: 26-28.06.2024
The participation at the conference is free of charge, the fee for the conference dinner is 30 Euro.