Dear list members (esp. those in the Japanese community),
for a cross-linguistic evaluation of co-reference annotations, I was interested into looking into the NAIST Coreference Corpus, which is based on the Kyoto Corpus. Luckily, both annotations are available, but not the primary text. According to the documentation of both corpora, it is necessary to acquire the Mainichi Shimbun CD-ROM (1995), first. I really tried my best, and I followed several catalogues (incl. https://www.jaist.ac.jp/project/NLP_Portal/doc/LR/lr-cat-e.html#jp:mainichi-...), but the URL is points to ( https://www.nichigai.co.jp/sales/mainichi/mainichi-data.html) isn't operational any more. Does anyone know where and how to buy that CDROM? Is there another way to get access to that data?
Thanks a lot, Christian
Dear all,
it seems there is not really a way these days to acquire this CD-ROM anymore.
- As a product, the CD-ROM seems to have been superseded by a digital archive (https://www.eastview.com/resources/newspapers/mainichi-shimbun/). But this doesn't help with the corpus because formats, etc. are all different. - The email I got from the Nichigai says distribution overseas isn't supported anymore (著作権者毎日新聞社が海外への頒布を認めておりません。)
I found a kind of fall-back solution with an inter-library loan of a copy in Berlin. But this is a bit fishy, because this copy is licensed for screen reading only, not for compiling a database out of it, and even if I have no intentions of distributing the results, this would probably break the licensing conditions. Whether this is legally ok or not depends on whether we consider this to be "decompilation". (The data is not "code" in a computer science sense, but if considered so, decompilation would be legal in Germany under certain circumstances.)
Some fun facts observed along the way: - The days of the global internet seem to be gone. At least Yahoo! Japan isn't serving Europe anymore. - Standoff annotation over proprietary data is much better than no annotations ...but if we depend on a third party for maintaining their services in a way that suits our needs (not the host's), it can never be sustainable. I mean the CD-based distribution worked for maybe 15 years, ... but it does not anymore. - We really need libraries and physical copies of digital media. (And someone maintaining the necessary hardware to read them.) - To some extent, software decompilation is legal in Germany ( https://www.gesetze-im-internet.de/urhg/__69e.html), but distribution of the results is restricted. - Copyright and database rights have an expiration date. If we create scientific resources from proprietary sources, data providers should consider deposing their data and their sources together with an appropriate embargo period under, say, Zenodo. Let it be some decades or up to a century, but at least, there is a long-term prospect. People will probably not be interested in the same things then as we are now, but we create and process data they might not be able to create themselves. Living languages will have changed until then, some of their uses will be fundamentally different from what we have now, and quite a bit of them will just have died out ...
None of the observations is particularly novel ..., but it doesn't hurt to be reminded from time to time that we should think about future users of our data ;) Special thanks to those who responded privately!
Best regards, Christian
Am Di., 9. Jan. 2024 um 12:31 Uhr schrieb Christian Chiarcos < christian.chiarcos@gmail.com>:
Dear list members (esp. those in the Japanese community),
for a cross-linguistic evaluation of co-reference annotations, I was interested into looking into the NAIST Coreference Corpus, which is based on the Kyoto Corpus. Luckily, both annotations are available, but not the primary text. According to the documentation of both corpora, it is necessary to acquire the Mainichi Shimbun CD-ROM (1995), first. I really tried my best, and I followed several catalogues (incl. https://www.jaist.ac.jp/project/NLP_Portal/doc/LR/lr-cat-e.html#jp:mainichi-...), but the URL is points to ( https://www.nichigai.co.jp/sales/mainichi/mainichi-data.html) isn't operational any more. Does anyone know where and how to buy that CDROM? Is there another way to get access to that data?
Thanks a lot, Christian