Dear list members,
We are excited to announce that the AIGC corpus - The aiTECCL Corpus, is now available to all members of the research community. The aiTECCL Corpus was compiled by Jiajin Xu and Mingchen Sun of the Corpus Research Group of the National Research Centre for Foreign Language Education at Beijing Foreign Studies University.
The corpus consisting of two million words generated by the GPT-3.5 model, using identical writing prompts to those employed in *the TECCL Corpus* http://corpus.bfsu.edu.cn/info/1070/1449.htm, aims to serve as a reference corpus that exhibits a native-like linguistic quality. The corpus is made available online on 9 August, 2023.
URL: http://114.251.154.212/cqp/
Username: test
Password: test
Please cite: Xu, Jiajin & Mingchen Sun. 2023. aiTECCL: An AIGC English Essay Corpus. Beijing: National Research Centre for Foreign Language Education, Beijing Foreign Studies University. Available online: http://corpus.bfsu.edu.cn/info/1082/1913.htm
*Justifying the concept of "AIGC Corpus" (Artificial Intelligence Generated Content Corpus) or Generative Corpus*
The creation of the AIGC Corpus helps expand the concept of "corpus". In the classic definition of a corpus, the included materials must be language samples that are authentically or naturally occurring in real-life communication. Clearly, generative texts do not fall under this category. We believe that the rationale for the generative corpus can be viewed from at least three aspects:
1. The so-called principle of "authenticity" itself is a matter of degree. For example, whether essays written by learners under exam conditions belong to genuine communication is questionable. In existing research, some elicited data also has authenticity issues similar to those found in learners' interlanguage. Therefore, from the perspective of existing corpora, there are texts with varying degrees of authenticity.
2. The generative corpus can serve as an essential complement to existing corpora. The emergence of the generative corpus can reconcile the distinction between "probable language" and "possible language." For linguistic instances that have not yet appeared in reality, they can be generated using large language models.
3. Creating a corpus using artificial intelligence technology is a second-to-best solution under the current conditions for building specific types of corpora. For example, the aiTECCL corpus simulates a reference corpus of approximately 10,000 essays, close to the English native speaker language quality, and written on the same topics as Chinese learners. Without the use of artificial intelligence methods for generation, it might be impossible to obtain a reference corpus of such quality and comparability. Similarly, for corpus construction of languages from least-developed countries or countries with extremely small populations, without generative technology, it would be impossible to establish in the short term.
Further details about the prompt and the Python script we utilised to create the corpus will be provided on the site soon http://corpus.bfsu.edu.cn/info/1082/1913.htm.
Best wishes,
Jiajin Xu
Ph.D., Professor
National Research Centre for Foreign Language Education
Beijing Foreign Studies University, China