The next meeting of the Edge Hill Corpus Research Group will take place online (via MS Teams) on Thursday 25 April 2024, 2:00-3:30 pm (UK time).
Attendance is free. You can register here: https://store.edgehill.ac.uk/conferences-and-events/conferences/events/edge-...
Topics: Corpus Methodology, Large Language Models
Speakers: Sylvia Jaworskahttps://www.reading.ac.uk/elal/staff/dr-sylvia-jaworska (University of Reading, UK) & Mathew Gillingshttps://www.wu.ac.at/ebc/about-us/team/mathew-gillings/ (Vienna University of Economics and Business, Austria)
Title: How humans vs. machines identify discourse topics: an exploratory triangulation
Abstract
Identifying discourses and discursive topics in a set of texts has not only been of interest to linguists, but to researchers working across social sciences. Traditionally, these analyses have been conducted based on small-scale interpretive analyses of discourse which involve some form of close reading. Naturally, however, that close reading is only possible when the dataset is small, and it leaves the analyst open to accusations of bias and/or cherry-picking.
Designed to avoid these issues, other methods have emerged which involve larger datasets and have some form of quantitative component. Within linguistics, this has typically been through the use of corpus-assisted methods, whilst outside of linguistics, topic modelling is one of the most widely-used approaches. Increasingly, researchers are also exploring the utility of LLMs (such as ChatGPT) to assist analyses and identification of topics. This talk reports on a study assessing the effect that analytical method has on the interpretation of texts, specifically in relation to the identification of the main topics. Using a corpus of corporate sustainability reports, totalling 98,277 words, we asked 6 different researchers, along with ChatGPT, to interrogate the corpus and decide on its main ‘topics’ via four different methods. Each method gradually increases in the amount of context available.
• Method A: ChatGPT is used to categorise the topic model output and assign topic labels; • Method B: Two researchers were asked to view a topic model output and assign topic labels based purely on eyeballing the co-occurring words; • Method C: Two researchers were asked to assign topic labels based on a concordance analysis of 100 randomised lines of each co-occurring word; • Method D: Two researchers were asked to reverse-engineer a topic model output by creating topic labels based on a close reading.
The talk explores how the identified topics differed both between researchers in the same condition, and between researchers in different conditions shedding light on some of the mechanisms underlying topic identification by machines vs humans or machines assisted by humans. We conclude with a series of tentative observations regarding the benefits and limitations of each method along with suggestions for researchers in selecting an analytical approach for discourse topic identification. While this study is exploratory and limited in scope, it opens up a way for further methodological and larger scale triangulations of corpus-based analyses with other computational methods including AI.
________________________________ Edge Hill Universityhttp://ehu.ac.uk/home/emailfooter Modern University of the Year, The Times and Sunday Times Good University Guide 2022http://ehu.ac.uk/tef/emailfooter University of the Year, Educate North 2021/21 ________________________________ This message is private and confidential. If you have received this message in error, please notify the sender and remove it from your system. Any views or opinions presented are solely those of the author and do not necessarily represent those of Edge Hill or associated companies. Edge Hill University may monitor email traffic data and also the content of email for the purposes of security and business communications during staff absence.http://ehu.ac.uk/itspolicies/emailfooter