Corpora of English novels

List overview All Threads
Download

newer

older

Junior Professorship in...

6th HPLT & NLPL Winter School on...

Martin Wynne

27 Oct 2024 27 Oct '24

1:11 p.m.

I have a student who is interested in tracing the development of the English novel from its origins to the present day (or at least to the start of the twentieth century), and I'm trying to gather information about relevant corpora covering this text type and period.

We know about the European Literary Text Collection (ELTeC, https://www.distant-reading.net/eltec/) which will be very useful for the later end of the timescale. We also know it is possible to assemble a corpus from Project Gutenberg, archive.org, Oxford Text Archive, etc. , but would be interested in re-using any corpora that people might already have made, which aim to be representative of particular periods within this genre.

The student has some flexibility with her research question, so while the original idea of 'English novels' was probably 'novels in English from Great Britain and Ireland', other related areas such as US novels might be interesting as well.

Any tips and suggestions gratefully received. If we get a number of interesting direct emails, I'll be happy to summarize the results to the list.

Best wishes, Martin

-- Senior Researcher in Corpus Linguistics Faculty of Linguistics, Philology and Phonetics, University of Oxford National Co-ordinator, CLARIN-UK martin.wynne@ling-phil.ox.ac.uk https://orcid.org/0000-0002-4155-0530

Show replies by date

Amir.Zeldes＠georgetown.edu

28 Oct 28 Oct

9:16 p.m.

Hi Martin,

I'm not sure if this will help, but if your student is interested in doing something with richly annotated Gutenberg data, there is a very deeply annotated corpus including about 0.5M tokens of samples from Project Gutenberg novels here, next to data from 7 other genres:

https://github.com/gucorpling/amalgum/blob/dev/amalgum/fiction/dep/AMALGUM_f...

The data is automatically annotated with good quality neural UD parses, coreference resolution, entity recognition, discourse parses and more, with excerpts from over 400 novels included. We also have a much smaller but manually annotated corpus which includes fiction, along with other genres in our GUM/GENTLE corpora (24 genres total):

https://gucorpling.org/gum/

Hope these are useful, Amir ------------ Dr. Amir Zeldes Assoc. Prof. of Computational Linguistics Department of Linguistics Georgetown University 1437 37th St. NW Washington, DC 20057

https://gucorpling.org/amir

-----Original Message----- From: Martin Wynne via Corpora corpora@list.elra.info Sent: Sunday, October 27, 2024 8:11 AM To: corpora@list.elra.info Subject: [Corpora-List] Corpora of English novels

We know about the European Literary Text Collection (ELTeC, https://www.google.com/url?q=https://www.distant-reading.net/eltec/&sour...) which will be very useful for the later end of the timescale. We also know it is possible to assemble a corpus from Project Gutenberg, archive.org, Oxford Text Archive, etc. , but would be interested in re-using any corpora that people might already have made, which aim to be representative of particular periods within this genre.

Any tips and suggestions gratefully received. If we get a number of interesting direct emails, I'll be happy to summarize the results to the list.

Best wishes, Martin

_______________________________________________ Corpora mailing list -- corpora@list.elra.info https://www.google.com/url?q=https://list.elra.info/mailman3/postorius/lists... To unsubscribe send an email to corpora-leave@list.elra.info

Martin Wynne

29 Oct 29 Oct

3:58 p.m.

Hi Amir,

Many thanks for getting in touch, and for letting me know about this. I think the student in this case wants full texts, but this data looks very interesting, particularly with the rich annotation, so will definitely be useful for a number of use cases, and I'll add it to my summary for the list.

Best wishes, Martin

On 28/10/2024 20:16, Amir.Zeldes@georgetown.edu wrote:

...

Hi Martin,

I'm not sure if this will help, but if your student is interested in doing something with richly annotated Gutenberg data, there is a very deeply annotated corpus including about 0.5M tokens of samples from Project Gutenberg novels here, next to data from 7 other genres:

https://github.com/gucorpling/amalgum/blob/dev/amalgum/fiction/dep/AMALGUM_f...

The data is automatically annotated with good quality neural UD parses, coreference resolution, entity recognition, discourse parses and more, with excerpts from over 400 novels included. We also have a much smaller but manually annotated corpus which includes fiction, along with other genres in our GUM/GENTLE corpora (24 genres total):

https://gucorpling.org/gum/

Hope these are useful, Amir

Dr. Amir Zeldes Assoc. Prof. of Computational Linguistics Department of Linguistics Georgetown University 1437 37th St. NW Washington, DC 20057

https://gucorpling.org/amir

-----Original Message----- From: Martin Wynne via Corpora corpora@list.elra.info Sent: Sunday, October 27, 2024 8:11 AM To: corpora@list.elra.info Subject: [Corpora-List] Corpora of English novels

I have a student who is interested in tracing the development of the English novel from its origins to the present day (or at least to the start of the twentieth century), and I'm trying to gather information about relevant corpora covering this text type and period.

We know about the European Literary Text Collection (ELTeC, https://www.google.com/url?q=https://www.distant-reading.net/eltec/&sour...) which will be very useful for the later end of the timescale. We also know it is possible to assemble a corpus from Project Gutenberg, archive.org, Oxford Text Archive, etc. , but would be interested in re-using any corpora that people might already have made, which aim to be representative of particular periods within this genre.

The student has some flexibility with her research question, so while the original idea of 'English novels' was probably 'novels in English from Great Britain and Ireland', other related areas such as US novels might be interesting as well.

Any tips and suggestions gratefully received. If we get a number of interesting direct emails, I'll be happy to summarize the results to the list.

Best wishes, Martin

-- Senior Researcher in Corpus Linguistics Faculty of Linguistics, Philology and Phonetics, University of Oxford National Co-ordinator, CLARIN-UK martin.wynne@ling-phil.ox.ac.uk https://www.google.com/url?q=https://orcid.org/0000-0002-4155-0530&sourc...

Corpora mailing list -- corpora@list.elra.info https://www.google.com/url?q=https://list.elra.info/mailman3/postorius/lists... To unsubscribe send an email to corpora-leave@list.elra.info

Michal Cukr

4 Nov 4 Nov

2:25 p.m.

Dear Martin,

It might be useful.

The Sketch Engine database provides access to two corpora relevant to your student's interest.

- Gutenberg English Corpus 2020 https://www.sketchengine.eu/gutenberg-corpora-2020/ - an almost 3-billion-word corpus containing all English books available through the Gutenberg platform as of April 2020. Note that the corpus metadata includes only the author’s birth and death years, not the year of publication. - English Historical Book Collection (EEBO, ECCO, Evans https://www.sketchengine.eu/historical-collection-eebo-ecco-evans/) - a collection of over 800 million words from English books published in the UK and US between 1473 and 1820.

With best regards,

Michal Cukr Sketch Engine team

On Tue, Oct 29, 2024 at 3:59 PM Martin Wynne via Corpora < corpora@list.elra.info> wrote:

...

Hi Amir,

Many thanks for getting in touch, and for letting me know about this. I think the student in this case wants full texts, but this data looks very interesting, particularly with the rich annotation, so will definitely be useful for a number of use cases, and I'll add it to my summary for the list.

Best wishes, Martin

On 28/10/2024 20:16, Amir.Zeldes@georgetown.edu wrote:

...
Hi Martin,

I'm not sure if this will help, but if your student is interested in

doing something with richly annotated Gutenberg data, there is a very deeply annotated corpus including about 0.5M tokens of samples from Project Gutenberg novels here, next to data from 7 other genres:

...
https://github.com/gucorpling/amalgum/blob/dev/amalgum/fiction/dep/AMALGUM_f...

...
The data is automatically annotated with good quality neural UD parses,

coreference resolution, entity recognition, discourse parses and more, with excerpts from over 400 novels included. We also have a much smaller but manually annotated corpus which includes fiction, along with other genres in our GUM/GENTLE corpora (24 genres total):

...
https://gucorpling.org/gum/

Hope these are useful, Amir

Dr. Amir Zeldes Assoc. Prof. of Computational Linguistics Department of Linguistics Georgetown University 1437 37th St. NW Washington, DC 20057

https://gucorpling.org/amir

-----Original Message----- From: Martin Wynne via Corpora corpora@list.elra.info Sent: Sunday, October 27, 2024 8:11 AM To: corpora@list.elra.info Subject: [Corpora-List] Corpora of English novels

I have a student who is interested in tracing the development of the

English novel from its origins to the present day (or at least to the start of the twentieth century), and I'm trying to gather information about relevant corpora covering this text type and period.

...
We know about the European Literary Text Collection (ELTeC,

https://www.google.com/url?q=https://www.distant-reading.net/eltec/&sour...) which will be very useful for the later end of the timescale. We also know it is possible to assemble a corpus from Project Gutenberg, archive.org, Oxford Text Archive, etc.

...
, but would be interested in re-using any corpora that people might

already have made, which aim to be representative of particular periods within this genre.

...
The student has some flexibility with her research question, so while

the original idea of 'English novels' was probably 'novels in English from Great Britain and Ireland', other related areas such as US novels might be interesting as well.

...
Any tips and suggestions gratefully received. If we get a number of

interesting direct emails, I'll be happy to summarize the results to the list.

...
Best wishes, Martin

-- Senior Researcher in Corpus Linguistics Faculty of Linguistics, Philology and Phonetics, University of Oxford

National Co-ordinator, CLARIN-UK martin.wynne@ling-phil.ox.ac.uk https://www.google.com/url?q=https://orcid.org/0000-0002-4155-0530&sourc...

...

Corpora mailing list -- corpora@list.elra.info

https://www.google.com/url?q=https://list.elra.info/mailman3/postorius/lists...

...
To unsubscribe send an email to corpora-leave@list.elra.info

-- Senior Researcher in Corpus Linguistics Faculty of Linguistics, Philology and Phonetics, University of Oxford National Co-ordinator, CLARIN-UK martin.wynne@ling-phil.ox.ac.uk https://orcid.org/0000-0002-4155-0530

Corpora mailing list -- corpora@list.elra.info https://list.elra.info/mailman3/postorius/lists/corpora.list.elra.info/ To unsubscribe send an email to corpora-leave@list.elra.info

Mark Davies

3:34 p.m.

Martin,

Sorry that I'm responding to this so late.

The 475 million word Corpus of Historical American English https://www.english-corpora.org/coha/ (COHA) has about 220 million words of fiction from the 1820s-2010s (see information on the number of texts and words by decade below). Nearly all of the texts from the 1820s-1980s are novels, whereas there are more short stories from the 1990s-2010s. Full information (for all texts) can be found here https://www.english-corpora.org/coha/files/sources-coha-2020.zip.

Full-text (downloadable) data can be found at CorpusData.org https://www.corpusdata.org/ as well as the Univ Stuttgart https://www.ims.uni-stuttgart.de/en/research/resources/corpora/ccoha/.

Best,

Mark Davies English-Corpora.org

------------------------------------------

decade #texts #words 1820s 90 3,778,554 1830s 179 7,492,464 1840s 243 8,615,569 1850s 151 9,175,764 1860s 249 9,279,356 1870s 217 10,454,445 1880s 264 11,204,077 1890s 257 11,261,720 1900s 266 12,096,794 1910s 296 12,266,683 1920s 281 12,668,146 1930s 533 11,959,731 1940s 420 12,030,426 1950s 470 12,014,411 1960s 403 11,652,761 1970s 335 11,652,921 1980s 334 11,664,130 1990s 1711 13,337,688 2000s 4224 14,624,639 2010s 3672 15,150,555 TOTAL 14595 222,380,834

...

From: Martin Wynne via Corpora corpora@list.elra.info

...
...
Sent: Sunday, October 27, 2024 8:11 AM To: corpora@list.elra.info Subject: [Corpora-List] Corpora of English novels

I have a student who is interested in tracing the development of the

English novel from its origins to the present day (or at least to the start of the twentieth century), and I'm trying to gather information about relevant corpora covering this text type and period.

...
We know about the European Literary Text Collection (ELTeC,

https://www.google.com/url?q=https://www.distant-reading.net/eltec/&sour...) which will be very useful for the later end of the timescale. We also know it is possible to assemble a corpus from Project Gutenberg, archive.org, Oxford Text Archive, etc.

...
, but would be interested in re-using any corpora that people might

already have made, which aim to be representative of particular periods within this genre.

...
The student has some flexibility with her research question, so while

the original idea of 'English novels' was probably 'novels in English from Great Britain and Ireland', other related areas such as US novels might be interesting as well.

...
Any tips and suggestions gratefully received. If we get a number of

interesting direct emails, I'll be happy to summarize the results to the list.

...
Best wishes, Martin

-- Senior Researcher in Corpus Linguistics Faculty of Linguistics, Philology and Phonetics, University of Oxford

National Co-ordinator, CLARIN-UK martin.wynne@ling-phil.ox.ac.uk https://www.google.com/url?q=https://orcid.org/0000-0002-4155-0530&sourc...

...

Corpora mailing list -- corpora@list.elra.info

https://www.google.com/url?q=https://list.elra.info/mailman3/postorius/lists...

...
To unsubscribe send an email to corpora-leave@list.elra.info

-- Senior Researcher in Corpus Linguistics Faculty of Linguistics, Philology and Phonetics, University of Oxford National Co-ordinator, CLARIN-UK martin.wynne@ling-phil.ox.ac.uk https://orcid.org/0000-0002-4155-0530

Corpora mailing list -- corpora@list.elra.info https://list.elra.info/mailman3/postorius/lists/corpora.list.elra.info/ To unsubscribe send an email to corpora-leave@list.elra.info

Corpora mailing list -- corpora@list.elra.info https://list.elra.info/mailman3/postorius/lists/corpora.list.elra.info/ To unsubscribe send an email to corpora-leave@list.elra.info

-- ============================================ Mark Davies english-corpora.org mark-davies.org ============================================

Martin Wynne

2 Dec 2 Dec

1:38 p.m.

To summarize the replies to my query for corpora of English novels, the following were suggested (in addition to ELTeC, mentioned in the initial query - European Literary Text Collection https://www.distant-reading.net/eltec/). Due to some difficulties in locating and accessing the corpora listed below under the first two items, the corpus creators have agreed to deposit them in the Oxford Text Archive collections - old and new links are given below:

1. Corpus of Late Modern English Texts (there is a version called CLMET 3.1 as well as the extended version) and the CEN (Corpus of English Novels) compiled by Hendrik de Smet and others:

CLMET now available here: http://hdl.handle.net/20.500.14106/2574 CEN now available here: http://hdl.handle.net/20.500.14106/2573

2. Corpus of the Canon of Western Literature https://www.researchgate.net/publication/321773386_Introducing_the_Corpus_of... (downloadable in full from https://www.researchgate.net/publication/385291433_CCWL_10_Jan_2018rar ).

CCWL now also available here: http://hdl.handle.net/20.500.14106/2575

3. Amalgum: richly annotated Gutenberg data in the form of a very deeply annotated corpus including about 0.5M tokens of samples from Project Gutenberg novels, next to data from 7 other genres:

https://github.com/gucorpling/amalgum/blob/dev/amalgum/fiction/dep/AMALGUM_f...

Many thanks to Bea Alex, Sabine Bartsch, Hendrik de Smet Clarence Green and Amir Zeldes

I note that this hasn't revealed a great number of available corpora! I suspect that there a large number of ad hoc datasets that people make for specific projects, sampling from the large number of text collections available, but I still don't find many open access reference corpora for particular genres and time periods.

The CLARIN Resource Family for Literary Corpora (https://www.clarin.eu/resource-families/literary-corpora) has a number in other languages, and I'll get this updated with the above corpora and anything else that I find for English.

Best wishes, Martin Wynne

On 27/10/2024 12:11, Martin Wynne via Corpora wrote:

...

I have a student who is interested in tracing the development of the English novel from its origins to the present day (or at least to the start of the twentieth century), and I'm trying to gather information about relevant corpora covering this text type and period.

We know about the European Literary Text Collection (ELTeC, https://www.distant-reading.net/eltec/) which will be very useful for the later end of the timescale. We also know it is possible to assemble a corpus from Project Gutenberg, archive.org, Oxford Text Archive, etc. , but would be interested in re-using any corpora that people might already have made, which aim to be representative of particular periods within this genre.

The student has some flexibility with her research question, so while the original idea of 'English novels' was probably 'novels in English from Great Britain and Ireland', other related areas such as US novels might be interesting as well.

Any tips and suggestions gratefully received. If we get a number of interesting direct emails, I'll be happy to summarize the results to the list.

Best wishes, Martin

488

Age (days ago)

524

Last active (days ago)

corpora@list.elra.info

5 comments

4 participants

tags (0)

participants (4)

Amir.Zeldes＠georgetown.edu
Mark Davies
Martin Wynne
Michal Cukr