To summarize the replies to my query for corpora of English novels, the following were suggested (in addition to ELTeC, mentioned in the initial query - European Literary Text Collection https://www.distant-reading.net/eltec/). Due to some difficulties in locating and accessing the corpora listed below under the first two items, the corpus creators have agreed to deposit them in the Oxford Text Archive collections - old and new links are given below:
1. Corpus of Late Modern English Texts (there is a version called CLMET 3.1 as well as the extended version) and the CEN (Corpus of English Novels) compiled by Hendrik de Smet and others:
CLMET now available here: http://hdl.handle.net/20.500.14106/2574 CEN now available here: http://hdl.handle.net/20.500.14106/2573
2. Corpus of the Canon of Western Literature https://www.researchgate.net/publication/321773386_Introducing_the_Corpus_of... (downloadable in full from https://www.researchgate.net/publication/385291433_CCWL_10_Jan_2018rar ).
CCWL now also available here: http://hdl.handle.net/20.500.14106/2575
3. Amalgum: richly annotated Gutenberg data in the form of a very deeply annotated corpus including about 0.5M tokens of samples from Project Gutenberg novels, next to data from 7 other genres:
https://github.com/gucorpling/amalgum/blob/dev/amalgum/fiction/dep/AMALGUM_f...
The data is automatically annotated with good quality neural UD parses, coreference resolution, entity recognition, discourse parses and more, with excerpts from over 400 novels included. We also have a much smaller but manually annotated corpus which includes fiction, along with other genres in our GUM/GENTLE corpora (24 genres total): https://gucorpling.org/gum/
Many thanks to Bea Alex, Sabine Bartsch, Hendrik de Smet Clarence Green and Amir Zeldes
I note that this hasn't revealed a great number of available corpora! I suspect that there a large number of ad hoc datasets that people make for specific projects, sampling from the large number of text collections available, but I still don't find many open access reference corpora for particular genres and time periods.
The CLARIN Resource Family for Literary Corpora (https://www.clarin.eu/resource-families/literary-corpora) has a number in other languages, and I'll get this updated with the above corpora and anything else that I find for English.
Best wishes, Martin Wynne
On 27/10/2024 12:11, Martin Wynne via Corpora wrote:
I have a student who is interested in tracing the development of the English novel from its origins to the present day (or at least to the start of the twentieth century), and I'm trying to gather information about relevant corpora covering this text type and period.
We know about the European Literary Text Collection (ELTeC, https://www.distant-reading.net/eltec/) which will be very useful for the later end of the timescale. We also know it is possible to assemble a corpus from Project Gutenberg, archive.org, Oxford Text Archive, etc. , but would be interested in re-using any corpora that people might already have made, which aim to be representative of particular periods within this genre.
The student has some flexibility with her research question, so while the original idea of 'English novels' was probably 'novels in English from Great Britain and Ireland', other related areas such as US novels might be interesting as well.
Any tips and suggestions gratefully received. If we get a number of interesting direct emails, I'll be happy to summarize the results to the list.
Best wishes, Martin