Dear Ada,
I agree that lemmatisation is a construct and is not a universal method for linguistic analyses, but I don't understand why it is imperative that I wean myself from using lemmas.
What is it that restricts my freedom to invent the lemma (a non-universal construct) AĞAÇ-, for example, to refer to the one and only "meaningful thing" that is common to the very many (theoretically infinite, practically probably around 10,000) strings including ağaç, ağacı, ağaca, ağaçlar, ağacımızdaki, ağaçlandırılabilmesinden, ağaçsızlaşmasını, etc. etc.? How (and why) am I supposed to talk about that very large set without using a label for it?
Best,
Orhan Bilgin
On 17 Oct 2023 18:36, Ada Wan via Corpora corpora@list.elra.info wrote:
This email originated outside the University. Check before clicking links or attachments.
Dear Christian
Re your PS: one doesn't need to debate the use/future of lemmatization, though I'd welcome such as part of scholarship. For those experienced in matters in/of Linguistics, it should be clear that lemmatization was simply a cconstruct, a entry-level philological exercise (esp. for those from Computer Science with less of a background in Linguistics and language(s)). It has been sad that some have picked up the habit of using lemmatization as a heuristic (though for what, specifically?) and might have become, apparently, too addicted to it to let it go. It is imperative that one weans themselves from such habit. Methods for linguistic morphology, e.g. (morphological) parsing or stemming, are not a universal decomposition scheme, nor a universal method for language/linguistic analyses. Also important is to bear in mind is that neither linguistic morphology nor lemmas/lemmata doesn't/don't have that long of a history.
Thanks for being open-minded enough to read this far.
Best Ada
Dear Ada, dear all,
I think it's necessary to discuss this in a separate thread. As for Hugh, he had a practical problem with an existing data set and we can discuss specific solutions for that. As for Ada, whether or not lemmatization is a valid NLP task can be discussed, as well, but this has absolutely nothing to do with the very concrete request for advice on a real problem at hand.
I really don't want to dive into this, but focus on the first part. Of course, there are applications where lemmatization as an NLP task was assumed to be necessary but is no longer needed. But lemmas were not invented for NLP, they were invented for structuring dictionaries and describing morphology actually several millenia before the computer (I'm thinking of Bronze Age dictionaries/word lists of cuneiform languages here, used for teaching Sumerian, but there even in our 3rd m. BCE Sumerian cuneiform corpus from the time when it was still spoken, there was a notion of lemma or head word, and scribes sometimes just wrote that because they were to lazy to write the full morphology). And the use of head words in dictionaries is a practice that won't go away as long as people are going to use dictionaries (be they digital or not) for language learning. And that's equally true for writing textbook grammars and for teaching morphology (you need some kind of base form to describe your inflection patterns), as it is for rule-based morphology (that won't go away, either, even though the use case is more on the low resource side of things ... low resource meaning few corpus data, no parallel data, just plain legacy word lists and grammar sketches). And also, it won't go away in corpus linguistics and the philologies, at least not for use cases where people come from a dictionary perspective.
Whether or not the use of lemmas (note that the question was actually not about lemmatization, but about data modelling) is a valid task depends on the use case. Working with humanists that want that because it's their established practice is a valid use case. We can debate with them, of course, but they are the experts on their use case, and I'd prefer to devote my energy to something more practically relevant, like getting them away from using MS Office for annotations or dictionaries and to use any tool that produces structured output, instead. And already this can be a hard problem that might eventually kill an otherwise interesting project. (Apologies, that's not true of everyone, of course, but those cases exist, and even where people understand the necessity, we still have to work with decades of legacy data to bring into shape.) As for the role of lemmatization in NLP, please continue to discuss without me.
@Ada, you seem to have a very concrete idea in mind how to get humanists away from getting lemmas. I guess that could be an interesting discussion at a conference on DH or language learning -- because this is where the requirement comes from.
Best, Christian
Am Di., 17. Okt. 2023 um 19:45 Uhr schrieb Bilgin, Orhan (Postgraduate Researcher) via Corpora corpora@list.elra.info:
Dear Ada,
I agree that lemmatisation is a construct and is not a universal method for linguistic analyses, but I don't understand why it is imperative that I wean myself from using lemmas.
What is it that restricts my freedom to invent the lemma (a non-universal construct) AĞAÇ-, for example, to refer to the one and only "meaningful thing" that is common to the very many (theoretically infinite, practically probably around 10,000) strings including ağaç, ağacı, ağaca, ağaçlar, ağacımızdaki, ağaçlandırılabilmesinden, ağaçsızlaşmasını, etc. etc.? How (and why) am I supposed to talk about that very large set without using a label for it?
Best,
Orhan Bilgin
On 17 Oct 2023 18:36, Ada Wan via Corpora corpora@list.elra.info wrote:
*This email originated outside the University. Check before clicking links or attachments.* Dear Christian
Re your PS: one doesn't need to debate the use/future of lemmatization, though I'd welcome such as part of scholarship. For those experienced in matters in/of Linguistics, it should be clear that lemmatization was simply a cconstruct, a entry-level philological exercise (esp. for those from Computer Science with less of a background in Linguistics and language(s)). It has been sad that some have picked up the habit of using lemmatization as a heuristic (though for what, specifically?) and might have become, apparently, too addicted to it to let it go. It is imperative that one weans themselves from such habit. Methods for linguistic morphology, e.g. (morphological) parsing or stemming, are not a universal decomposition scheme, nor a universal method for language/linguistic analyses. Also important is to bear in mind is that neither linguistic morphology nor lemmas/lemmata doesn't/don't have that long of a history.
Thanks for being open-minded enough to read this far.
Best Ada
Corpora mailing list -- corpora@list.elra.info https://list.elra.info/mailman3/postorius/lists/corpora.list.elra.info/ To unsubscribe send an email to corpora-leave@list.elra.info
Dear Christian, dear all [pls feel free to disregard if not interested]
Discussion on a separate thread is fine.
1. Re "whether or not lemmatization is a valid NLP task": I must first clarify that I am not on this mailing list just for "NLP issues" (however "NLP" should be defined or regardless of whether it should exist as an area with "word"-based or non-general/generalizable methods beyond "machine/computational/automatic processing with (text) data"). Machine processing with data, text/"language" or not, can be done without "words" anyway. I did not just question whether lemmatization is a valid NLP task. I questioned (the necessity/validity of) morphology in general, which would affect the practice of lemmatization.
2. Re "lemmas were not invented for NLP": it depends what "lemmas" and "NLP" refer to. The study of morphology/morphs/morphemes certainly dates back quite some time (e.g. Panini --- in terms of decomposing a "word" into smaller parts). BUT the practice of naming segments as "lemmas" (and not "morphs"/"morphemes") and the use of the term "lemma" might have come from computational linguists/lexicographers and/or computing. Computing practices might have reinforced the practice of lemmatization/segmentation throughout the past decades, since back in the days (e.g. 1960s-1970s? [1]) when memory was more of an issue or when linguistic techniques were leveraged when computing with text.
3. Re "Bronze Age dictionaries/word lists of cuneiform languages": i. some of these are effects of interpretation (much of which dated back to the modern era, e.g. papyrology from the 19th century); ii. I do not argue against the possibility/practice of decomposition in general, but (linguistic) morphology is not a general decomposition approach for its being based on a notion of "stemhood" that can be arbitrary, indeterminate, "culture"/context-specific, and/or idiosyncratic (recall many hard-to-decipher symbols/graphemes on many ancient manuscripts). A more general method would be to decompose in a granularity that is fine enough and recompose based on frequency (as that's also often a pivotal criterion for empirical analyses and interpretations).
4. Re "the use of head words in dictionaries is a practice that won't go away as long as people are going to use dictionaries ... for language learning": many lexical resources (e.g. dictionaries) are based on character n-grams and do not leverage the notion of "head words". The notion of "morphology" is hence orthogonal/irrelevant. (Remark: a few decades ago, it might have been much easier in some parts of the world to get clarity on this --- just by walking into a bookstore or a library and looking at the plethora of lexical resources --- of different types/formats/designs, in general or for particular disciplines. But that practice seems to have (almost) become a lost art now. For "language learning", I'd recommend the immersion method. Nothing beats experiencing communication in multi-dimensional, full-bodied contexts. Use lexical resources only as mnemonics of sorts (don't become too pedantic/addicted on such). Use style guides (or "grammar textbooks") only when pleasing others is necessary. :) Just thought to note to those on this list who might be interested.)
5. Re "inflection patterns": please see my reply to Orhan earlier today: [tweet/x] The solution can be adapted for "(morphological) segmentation" as well. Please let me know if it is fine to you or if you have any objections.
6. Re "low resource ... just plain legacy word lists and grammar sketches": if one works in data collection: for varieties that are still alive, one should record raw and full data when possible and retire the ("colonial") practice of elicitation based on "words". One could also try to obtain parallel data in larger spans instead. For varieties that are extinct, one archives what one has. For what purposes should any "word"-based practices or linguistic morphology be involved?
7. "won't go away in corpus linguistics and the philologies": May it be for corpus linguistics, the philologies, the humanities and the social sciences --- digital or not, for "practical" purposes or not, everything (methods, approaches, interpretations, reception... etc.) can be updated.
8. Re "[w]hether or not the use of lemmas ... is a valid task depends on the use case" and "data modeling": sure, the use of tools can depend on the purpose of the task. But the issue here is: if the use of lemmas is only good for the task of lemmatization, and if the use of lemmatization with text data is only good for linguistic morphology, and when morphology is found not (or no longer) relevant/useful/correct/appropriate, what do we do with a curriculum that overfits on one representation granularity that does not have a solid foundation? What do we do with students/graduates who were fed archaic ideals?
Best Ada (Some often forget that I am also a linguist, not just a "computational person", among other roles/interests.)
[1] My dating references here are supported by: "Algorithms for stemming have been studied in computer science since the 1960s." and "The first published stemmer was written by Julie Beth Lovins in 1968." ( https://en.wikipedia.org/wiki/Stemming) I would've guessed from the 1950s otherwise....
On Wed, Oct 18, 2023 at 9:05 AM Christian Chiarcos via Corpora < corpora@list.elra.info> wrote:
Dear Ada, dear all,
I think it's necessary to discuss this in a separate thread. As for Hugh, he had a practical problem with an existing data set and we can discuss specific solutions for that. As for Ada, whether or not lemmatization is a valid NLP task can be discussed, as well, but this has absolutely nothing to do with the very concrete request for advice on a real problem at hand.
I really don't want to dive into this, but focus on the first part. Of course, there are applications where lemmatization as an NLP task was assumed to be necessary but is no longer needed. But lemmas were not invented for NLP, they were invented for structuring dictionaries and describing morphology actually several millenia before the computer (I'm thinking of Bronze Age dictionaries/word lists of cuneiform languages here, used for teaching Sumerian, but there even in our 3rd m. BCE Sumerian cuneiform corpus from the time when it was still spoken, there was a notion of lemma or head word, and scribes sometimes just wrote that because they were to lazy to write the full morphology). And the use of head words in dictionaries is a practice that won't go away as long as people are going to use dictionaries (be they digital or not) for language learning. And that's equally true for writing textbook grammars and for teaching morphology (you need some kind of base form to describe your inflection patterns), as it is for rule-based morphology (that won't go away, either, even though the use case is more on the low resource side of things ... low resource meaning few corpus data, no parallel data, just plain legacy word lists and grammar sketches). And also, it won't go away in corpus linguistics and the philologies, at least not for use cases where people come from a dictionary perspective.
Whether or not the use of lemmas (note that the question was actually not about lemmatization, but about data modelling) is a valid task depends on the use case. Working with humanists that want that because it's their established practice is a valid use case. We can debate with them, of course, but they are the experts on their use case, and I'd prefer to devote my energy to something more practically relevant, like getting them away from using MS Office for annotations or dictionaries and to use any tool that produces structured output, instead. And already this can be a hard problem that might eventually kill an otherwise interesting project. (Apologies, that's not true of everyone, of course, but those cases exist, and even where people understand the necessity, we still have to work with decades of legacy data to bring into shape.) As for the role of lemmatization in NLP, please continue to discuss without me.
@Ada, you seem to have a very concrete idea in mind how to get humanists away from getting lemmas. I guess that could be an interesting discussion at a conference on DH or language learning -- because this is where the requirement comes from.
Best, Christian
Am Di., 17. Okt. 2023 um 19:45 Uhr schrieb Bilgin, Orhan (Postgraduate Researcher) via Corpora corpora@list.elra.info:
Dear Ada,
I agree that lemmatisation is a construct and is not a universal method for linguistic analyses, but I don't understand why it is imperative that I wean myself from using lemmas.
What is it that restricts my freedom to invent the lemma (a non-universal construct) AĞAÇ-, for example, to refer to the one and only "meaningful thing" that is common to the very many (theoretically infinite, practically probably around 10,000) strings including ağaç, ağacı, ağaca, ağaçlar, ağacımızdaki, ağaçlandırılabilmesinden, ağaçsızlaşmasını, etc. etc.? How (and why) am I supposed to talk about that very large set without using a label for it?
Best,
Orhan Bilgin
On 17 Oct 2023 18:36, Ada Wan via Corpora corpora@list.elra.info wrote:
*This email originated outside the University. Check before clicking links or attachments.* Dear Christian
Re your PS: one doesn't need to debate the use/future of lemmatization, though I'd welcome such as part of scholarship. For those experienced in matters in/of Linguistics, it should be clear that lemmatization was simply a cconstruct, a entry-level philological exercise (esp. for those from Computer Science with less of a background in Linguistics and language(s)). It has been sad that some have picked up the habit of using lemmatization as a heuristic (though for what, specifically?) and might have become, apparently, too addicted to it to let it go. It is imperative that one weans themselves from such habit. Methods for linguistic morphology, e.g. (morphological) parsing or stemming, are not a universal decomposition scheme, nor a universal method for language/linguistic analyses. Also important is to bear in mind is that neither linguistic morphology nor lemmas/lemmata doesn't/don't have that long of a history.
Thanks for being open-minded enough to read this far.
Best Ada
Corpora mailing list -- corpora@list.elra.info https://list.elra.info/mailman3/postorius/lists/corpora.list.elra.info/ To unsubscribe send an email to corpora-leave@list.elra.info
Corpora mailing list -- corpora@list.elra.info https://list.elra.info/mailman3/postorius/lists/corpora.list.elra.info/ To unsubscribe send an email to corpora-leave@list.elra.info
@Ada
What do we do with students/graduates who were fed archaic ideals?
You give them full professorships. ;-)
- Hugh
On Wed, Oct 25, 2023 at 7:29 AM Ada Wan via Corpora corpora@list.elra.info wrote:
Dear Christian, dear all [pls feel free to disregard if not interested]
Discussion on a separate thread is fine.
- Re "whether or not lemmatization is a valid NLP task":
I must first clarify that I am not on this mailing list just for "NLP issues" (however "NLP" should be defined or regardless of whether it should exist as an area with "word"-based or non-general/generalizable methods beyond "machine/computational/automatic processing with (text) data"). Machine processing with data, text/"language" or not, can be done without "words" anyway. I did not just question whether lemmatization is a valid NLP task. I questioned (the necessity/validity of) morphology in general, which would affect the practice of lemmatization.
- Re "lemmas were not invented for NLP":
it depends what "lemmas" and "NLP" refer to. The study of morphology/morphs/morphemes certainly dates back quite some time (e.g. Panini --- in terms of decomposing a "word" into smaller parts). BUT the practice of naming segments as "lemmas" (and not "morphs"/"morphemes") and the use of the term "lemma" might have come from computational linguists/lexicographers and/or computing. Computing practices might have reinforced the practice of lemmatization/segmentation throughout the past decades, since back in the days (e.g. 1960s-1970s? [1]) when memory was more of an issue or when linguistic techniques were leveraged when computing with text.
- Re "Bronze Age dictionaries/word lists of cuneiform languages":
i. some of these are effects of interpretation (much of which dated back to the modern era, e.g. papyrology from the 19th century); ii. I do not argue against the possibility/practice of decomposition in general, but (linguistic) morphology is not a general decomposition approach for its being based on a notion of "stemhood" that can be arbitrary, indeterminate, "culture"/context-specific, and/or idiosyncratic (recall many hard-to-decipher symbols/graphemes on many ancient manuscripts). A more general method would be to decompose in a granularity that is fine enough and recompose based on frequency (as that's also often a pivotal criterion for empirical analyses and interpretations).
- Re "the use of head words in dictionaries is a practice that won't go
away as long as people are going to use dictionaries ... for language learning": many lexical resources (e.g. dictionaries) are based on character n-grams and do not leverage the notion of "head words". The notion of "morphology" is hence orthogonal/irrelevant. (Remark: a few decades ago, it might have been much easier in some parts of the world to get clarity on this --- just by walking into a bookstore or a library and looking at the plethora of lexical resources --- of different types/formats/designs, in general or for particular disciplines. But that practice seems to have (almost) become a lost art now. For "language learning", I'd recommend the immersion method. Nothing beats experiencing communication in multi-dimensional, full-bodied contexts. Use lexical resources only as mnemonics of sorts (don't become too pedantic/addicted on such). Use style guides (or "grammar textbooks") only when pleasing others is necessary. :) Just thought to note to those on this list who might be interested.)
- Re "inflection patterns": please see my reply to Orhan earlier today:
[tweet/x] The solution can be adapted for "(morphological) segmentation" as well. Please let me know if it is fine to you or if you have any objections.
- Re "low resource ... just plain legacy word lists and grammar sketches":
if one works in data collection: for varieties that are still alive, one should record raw and full data when possible and retire the ("colonial") practice of elicitation based on "words". One could also try to obtain parallel data in larger spans instead. For varieties that are extinct, one archives what one has. For what purposes should any "word"-based practices or linguistic morphology be involved?
- "won't go away in corpus linguistics and the philologies":
May it be for corpus linguistics, the philologies, the humanities and the social sciences --- digital or not, for "practical" purposes or not, everything (methods, approaches, interpretations, reception... etc.) can be updated.
- Re "[w]hether or not the use of lemmas ... is a valid task depends on
the use case" and "data modeling": sure, the use of tools can depend on the purpose of the task. But the issue here is: if the use of lemmas is only good for the task of lemmatization, and if the use of lemmatization with text data is only good for linguistic morphology, and when morphology is found not (or no longer) relevant/useful/correct/appropriate, what do we do with a curriculum that overfits on one representation granularity that does not have a solid foundation? What do we do with students/graduates who were fed archaic ideals?
Best Ada (Some often forget that I am also a linguist, not just a "computational person", among other roles/interests.)
[1] My dating references here are supported by: "Algorithms for stemming have been studied in computer science since the 1960s." and "The first published stemmer was written by Julie Beth Lovins in 1968." ( https://en.wikipedia.org/wiki/Stemming) I would've guessed from the 1950s otherwise....
On Wed, Oct 18, 2023 at 9:05 AM Christian Chiarcos via Corpora < corpora@list.elra.info> wrote:
Dear Ada, dear all,
I think it's necessary to discuss this in a separate thread. As for Hugh, he had a practical problem with an existing data set and we can discuss specific solutions for that. As for Ada, whether or not lemmatization is a valid NLP task can be discussed, as well, but this has absolutely nothing to do with the very concrete request for advice on a real problem at hand.
I really don't want to dive into this, but focus on the first part. Of course, there are applications where lemmatization as an NLP task was assumed to be necessary but is no longer needed. But lemmas were not invented for NLP, they were invented for structuring dictionaries and describing morphology actually several millenia before the computer (I'm thinking of Bronze Age dictionaries/word lists of cuneiform languages here, used for teaching Sumerian, but there even in our 3rd m. BCE Sumerian cuneiform corpus from the time when it was still spoken, there was a notion of lemma or head word, and scribes sometimes just wrote that because they were to lazy to write the full morphology). And the use of head words in dictionaries is a practice that won't go away as long as people are going to use dictionaries (be they digital or not) for language learning. And that's equally true for writing textbook grammars and for teaching morphology (you need some kind of base form to describe your inflection patterns), as it is for rule-based morphology (that won't go away, either, even though the use case is more on the low resource side of things ... low resource meaning few corpus data, no parallel data, just plain legacy word lists and grammar sketches). And also, it won't go away in corpus linguistics and the philologies, at least not for use cases where people come from a dictionary perspective.
Whether or not the use of lemmas (note that the question was actually not about lemmatization, but about data modelling) is a valid task depends on the use case. Working with humanists that want that because it's their established practice is a valid use case. We can debate with them, of course, but they are the experts on their use case, and I'd prefer to devote my energy to something more practically relevant, like getting them away from using MS Office for annotations or dictionaries and to use any tool that produces structured output, instead. And already this can be a hard problem that might eventually kill an otherwise interesting project. (Apologies, that's not true of everyone, of course, but those cases exist, and even where people understand the necessity, we still have to work with decades of legacy data to bring into shape.) As for the role of lemmatization in NLP, please continue to discuss without me.
@Ada, you seem to have a very concrete idea in mind how to get humanists away from getting lemmas. I guess that could be an interesting discussion at a conference on DH or language learning -- because this is where the requirement comes from.
Best, Christian
Am Di., 17. Okt. 2023 um 19:45 Uhr schrieb Bilgin, Orhan (Postgraduate Researcher) via Corpora corpora@list.elra.info:
Dear Ada,
I agree that lemmatisation is a construct and is not a universal method for linguistic analyses, but I don't understand why it is imperative that I wean myself from using lemmas.
What is it that restricts my freedom to invent the lemma (a non-universal construct) AĞAÇ-, for example, to refer to the one and only "meaningful thing" that is common to the very many (theoretically infinite, practically probably around 10,000) strings including ağaç, ağacı, ağaca, ağaçlar, ağacımızdaki, ağaçlandırılabilmesinden, ağaçsızlaşmasını, etc. etc.? How (and why) am I supposed to talk about that very large set without using a label for it?
Best,
Orhan Bilgin
On 17 Oct 2023 18:36, Ada Wan via Corpora corpora@list.elra.info wrote:
*This email originated outside the University. Check before clicking links or attachments.* Dear Christian
Re your PS: one doesn't need to debate the use/future of lemmatization, though I'd welcome such as part of scholarship. For those experienced in matters in/of Linguistics, it should be clear that lemmatization was simply a cconstruct, a entry-level philological exercise (esp. for those from Computer Science with less of a background in Linguistics and language(s)). It has been sad that some have picked up the habit of using lemmatization as a heuristic (though for what, specifically?) and might have become, apparently, too addicted to it to let it go. It is imperative that one weans themselves from such habit. Methods for linguistic morphology, e.g. (morphological) parsing or stemming, are not a universal decomposition scheme, nor a universal method for language/linguistic analyses. Also important is to bear in mind is that neither linguistic morphology nor lemmas/lemmata doesn't/don't have that long of a history.
Thanks for being open-minded enough to read this far.
Best Ada
Corpora mailing list -- corpora@list.elra.info https://list.elra.info/mailman3/postorius/lists/corpora.list.elra.info/ To unsubscribe send an email to corpora-leave@list.elra.info
Corpora mailing list -- corpora@list.elra.info https://list.elra.info/mailman3/postorius/lists/corpora.list.elra.info/ To unsubscribe send an email to corpora-leave@list.elra.info
Corpora mailing list -- corpora@list.elra.info https://list.elra.info/mailman3/postorius/lists/corpora.list.elra.info/ To unsubscribe send an email to corpora-leave@list.elra.info
@Hugh
Re ">What do we do with students/graduates who were fed archaic ideals? You give them full professorships. ;-)":
IFF they are:
i. brave enough to say that these are indeed archaic ideals,
ii. wise enough to teach history,
iii. insightful enough to teach future (or even just current and appropriate) techniques and values (that may include but not be limited to engaging in the teaching/research of e.g. explicit data statistics in the context of ML/"NLP"/computing/"language science"),
iv. knowledgeable and experienced enough to understand how "language" and empirical matters work and relate to theoretical ones, and
v. conscientious enough to respect scientific integrity --- esp. when it comes to explaining away which from the "word"-based narrative from the "language space" may have been a hoax, which an oversight, which foul play, and/or which an outgoing paradigm (and that if one "had to teach 'language'", one'd let students know it is something to learn and unlearn, but prepare them sufficiently so they won't be deceived by others in the future when it comes to "language matters", e.g. through doing "words"),
THEN one could give them full professorship*. That is, only if they are good (unlike most/all current CL profs... pardon the wake-up call).
*humble/shameless brag/plug: could someone please put me onto this list? ;-)
Ada
On Wed, Oct 25, 2023 at 8:58 PM Hugh Paterson III sil.linguist@gmail.com wrote:
@Ada
What do we do with students/graduates who were fed archaic ideals?
You give them full professorships. ;-)
- Hugh
On Wed, Oct 25, 2023 at 7:29 AM Ada Wan via Corpora < corpora@list.elra.info> wrote:
Dear Christian, dear all [pls feel free to disregard if not interested]
Discussion on a separate thread is fine.
- Re "whether or not lemmatization is a valid NLP task":
I must first clarify that I am not on this mailing list just for "NLP issues" (however "NLP" should be defined or regardless of whether it should exist as an area with "word"-based or non-general/generalizable methods beyond "machine/computational/automatic processing with (text) data"). Machine processing with data, text/"language" or not, can be done without "words" anyway. I did not just question whether lemmatization is a valid NLP task. I questioned (the necessity/validity of) morphology in general, which would affect the practice of lemmatization.
- Re "lemmas were not invented for NLP":
it depends what "lemmas" and "NLP" refer to. The study of morphology/morphs/morphemes certainly dates back quite some time (e.g. Panini --- in terms of decomposing a "word" into smaller parts). BUT the practice of naming segments as "lemmas" (and not "morphs"/"morphemes") and the use of the term "lemma" might have come from computational linguists/lexicographers and/or computing. Computing practices might have reinforced the practice of lemmatization/segmentation throughout the past decades, since back in the days (e.g. 1960s-1970s? [1]) when memory was more of an issue or when linguistic techniques were leveraged when computing with text.
- Re "Bronze Age dictionaries/word lists of cuneiform languages":
i. some of these are effects of interpretation (much of which dated back to the modern era, e.g. papyrology from the 19th century); ii. I do not argue against the possibility/practice of decomposition in general, but (linguistic) morphology is not a general decomposition approach for its being based on a notion of "stemhood" that can be arbitrary, indeterminate, "culture"/context-specific, and/or idiosyncratic (recall many hard-to-decipher symbols/graphemes on many ancient manuscripts). A more general method would be to decompose in a granularity that is fine enough and recompose based on frequency (as that's also often a pivotal criterion for empirical analyses and interpretations).
- Re "the use of head words in dictionaries is a practice that won't go
away as long as people are going to use dictionaries ... for language learning": many lexical resources (e.g. dictionaries) are based on character n-grams and do not leverage the notion of "head words". The notion of "morphology" is hence orthogonal/irrelevant. (Remark: a few decades ago, it might have been much easier in some parts of the world to get clarity on this --- just by walking into a bookstore or a library and looking at the plethora of lexical resources --- of different types/formats/designs, in general or for particular disciplines. But that practice seems to have (almost) become a lost art now. For "language learning", I'd recommend the immersion method. Nothing beats experiencing communication in multi-dimensional, full-bodied contexts. Use lexical resources only as mnemonics of sorts (don't become too pedantic/addicted on such). Use style guides (or "grammar textbooks") only when pleasing others is necessary. :) Just thought to note to those on this list who might be interested.)
- Re "inflection patterns": please see my reply to Orhan earlier today:
[tweet/x] The solution can be adapted for "(morphological) segmentation" as well. Please let me know if it is fine to you or if you have any objections.
- Re "low resource ... just plain legacy word lists and grammar
sketches": if one works in data collection: for varieties that are still alive, one should record raw and full data when possible and retire the ("colonial") practice of elicitation based on "words". One could also try to obtain parallel data in larger spans instead. For varieties that are extinct, one archives what one has. For what purposes should any "word"-based practices or linguistic morphology be involved?
- "won't go away in corpus linguistics and the philologies":
May it be for corpus linguistics, the philologies, the humanities and the social sciences --- digital or not, for "practical" purposes or not, everything (methods, approaches, interpretations, reception... etc.) can be updated.
- Re "[w]hether or not the use of lemmas ... is a valid task depends on
the use case" and "data modeling": sure, the use of tools can depend on the purpose of the task. But the issue here is: if the use of lemmas is only good for the task of lemmatization, and if the use of lemmatization with text data is only good for linguistic morphology, and when morphology is found not (or no longer) relevant/useful/correct/appropriate, what do we do with a curriculum that overfits on one representation granularity that does not have a solid foundation? What do we do with students/graduates who were fed archaic ideals?
Best Ada (Some often forget that I am also a linguist, not just a "computational person", among other roles/interests.)
[1] My dating references here are supported by: "Algorithms for stemming have been studied in computer science since the 1960s." and "The first published stemmer was written by Julie Beth Lovins in 1968." ( https://en.wikipedia.org/wiki/Stemming) I would've guessed from the 1950s otherwise....
On Wed, Oct 18, 2023 at 9:05 AM Christian Chiarcos via Corpora < corpora@list.elra.info> wrote:
Dear Ada, dear all,
I think it's necessary to discuss this in a separate thread. As for Hugh, he had a practical problem with an existing data set and we can discuss specific solutions for that. As for Ada, whether or not lemmatization is a valid NLP task can be discussed, as well, but this has absolutely nothing to do with the very concrete request for advice on a real problem at hand.
I really don't want to dive into this, but focus on the first part. Of course, there are applications where lemmatization as an NLP task was assumed to be necessary but is no longer needed. But lemmas were not invented for NLP, they were invented for structuring dictionaries and describing morphology actually several millenia before the computer (I'm thinking of Bronze Age dictionaries/word lists of cuneiform languages here, used for teaching Sumerian, but there even in our 3rd m. BCE Sumerian cuneiform corpus from the time when it was still spoken, there was a notion of lemma or head word, and scribes sometimes just wrote that because they were to lazy to write the full morphology). And the use of head words in dictionaries is a practice that won't go away as long as people are going to use dictionaries (be they digital or not) for language learning. And that's equally true for writing textbook grammars and for teaching morphology (you need some kind of base form to describe your inflection patterns), as it is for rule-based morphology (that won't go away, either, even though the use case is more on the low resource side of things ... low resource meaning few corpus data, no parallel data, just plain legacy word lists and grammar sketches). And also, it won't go away in corpus linguistics and the philologies, at least not for use cases where people come from a dictionary perspective.
Whether or not the use of lemmas (note that the question was actually not about lemmatization, but about data modelling) is a valid task depends on the use case. Working with humanists that want that because it's their established practice is a valid use case. We can debate with them, of course, but they are the experts on their use case, and I'd prefer to devote my energy to something more practically relevant, like getting them away from using MS Office for annotations or dictionaries and to use any tool that produces structured output, instead. And already this can be a hard problem that might eventually kill an otherwise interesting project. (Apologies, that's not true of everyone, of course, but those cases exist, and even where people understand the necessity, we still have to work with decades of legacy data to bring into shape.) As for the role of lemmatization in NLP, please continue to discuss without me.
@Ada, you seem to have a very concrete idea in mind how to get humanists away from getting lemmas. I guess that could be an interesting discussion at a conference on DH or language learning -- because this is where the requirement comes from.
Best, Christian
Am Di., 17. Okt. 2023 um 19:45 Uhr schrieb Bilgin, Orhan (Postgraduate Researcher) via Corpora corpora@list.elra.info:
Dear Ada,
I agree that lemmatisation is a construct and is not a universal method for linguistic analyses, but I don't understand why it is imperative that I wean myself from using lemmas.
What is it that restricts my freedom to invent the lemma (a non-universal construct) AĞAÇ-, for example, to refer to the one and only "meaningful thing" that is common to the very many (theoretically infinite, practically probably around 10,000) strings including ağaç, ağacı, ağaca, ağaçlar, ağacımızdaki, ağaçlandırılabilmesinden, ağaçsızlaşmasını, etc. etc.? How (and why) am I supposed to talk about that very large set without using a label for it?
Best,
Orhan Bilgin
On 17 Oct 2023 18:36, Ada Wan via Corpora corpora@list.elra.info wrote:
*This email originated outside the University. Check before clicking links or attachments.* Dear Christian
Re your PS: one doesn't need to debate the use/future of lemmatization, though I'd welcome such as part of scholarship. For those experienced in matters in/of Linguistics, it should be clear that lemmatization was simply a cconstruct, a entry-level philological exercise (esp. for those from Computer Science with less of a background in Linguistics and language(s)). It has been sad that some have picked up the habit of using lemmatization as a heuristic (though for what, specifically?) and might have become, apparently, too addicted to it to let it go. It is imperative that one weans themselves from such habit. Methods for linguistic morphology, e.g. (morphological) parsing or stemming, are not a universal decomposition scheme, nor a universal method for language/linguistic analyses. Also important is to bear in mind is that neither linguistic morphology nor lemmas/lemmata doesn't/don't have that long of a history.
Thanks for being open-minded enough to read this far.
Best Ada
Corpora mailing list -- corpora@list.elra.info https://list.elra.info/mailman3/postorius/lists/corpora.list.elra.info/ To unsubscribe send an email to corpora-leave@list.elra.info
Corpora mailing list -- corpora@list.elra.info https://list.elra.info/mailman3/postorius/lists/corpora.list.elra.info/ To unsubscribe send an email to corpora-leave@list.elra.info
Corpora mailing list -- corpora@list.elra.info https://list.elra.info/mailman3/postorius/lists/corpora.list.elra.info/ To unsubscribe send an email to corpora-leave@list.elra.info
Hi Ada,
When my niece was 3 year old, she said to her little brother “Maman, elle venira plus tard…” (Mum will come back later, in “incorrect” French).
She made a “mistake" here by using “venira” (a wrong future form for verb venir (to come)) instead of the “correct" “viendra”. It was wrong, but perfectly predictable using the most productive morphological rules of French future formation.
She was 3 years old, so I doubt she was really understanding what morphology is, nevertheless, with this mistake, she clearly showed me that her way of learning languages did not consisted in reading/listening to huge amounts of utterances but she was able to learn some word formation rules from very few examples. And indeed, human is still able to perfectly learn complex things with very small explanation and/or very few example (something that is totally beyond ML based language models).
In my humble opinion, this proves that morphology exists, if not in the LLM matrixes, at least in the human brain. Hence modelling such rules (and even using them to analyse or produce) is a valid approach, independently of any other (also valid) approaches.
If I want to say it another way :
There has been many scientific proofs that human will not be able to fly… And these proofs were valid under their own hypothesis.
Indeed, planes do not flap their wings… they are using other ways to perform a task that was performed by birds.
Nevertheless, I have never been the witness of any plane (or pilot) trying to convince birds that their way of flying is obsolete (or issued from a colonialist point of view of Aves on the task at hand…) and asking them to renounce this oh so obsolete bad habit.
Regards,
Gilles,
I have also been carefully reading the exchanges. Although I was planning not to add to this exchange, at this point I am tempted to reply.
Ada's early emails were adding something to the discussion and debate, but at this point they are simply saying 'I am right, you are wrong', without giving any explanation or evidence.
I was also thinking of the same kind of examples as given by Gilles. Till Ada provides some very good reasoning and evidence, it is hard for me to completely agree with her, although as I said earlier, I do agree with her on many, perhaps most of things.
Ada, I sincerely respect your learning and competence. However, you said earlier you are proposing an alternative computational phenomenology. That would be really interesting. Won't it be better to first propose it and argue in more specific terms and with more convincing arguments and evidence that it is the right one, or at least 'more right' than the existing ones (there are more than one). Given that there is already Information Theory, it has to go beyond byte, which is an accidental unit of computation, and character, which is also not well-defined, sometimes even for one specific writing system. To give one such example, perhaps not the best one, I always thought of Indic script dependent vowel (maatraa) as a character, but I recently found that languages like Java and Python do not treat such written symbols as character, so when I try to get the length of an Indic-script string, the in-built string length functions give only the number of consonant symbols and independent vowels in the string. We got wrong results using these functions and I only accidentally discovered that this is the case. The reason, of course, is that these functions and programming languages treat such dependent vowels as diacritics, which is also correct in some ways. I did not realize this earlier because in India we often use a Latin script-based notation called WX for Indic scripts in NLP due to the encoding and input method related problems that I referred to in one of my earlier replies. The WX notation, however, does not distinguish between dependent and independent vowels and treats both of them as the same character, which is how most of us, if not all, think of them in India to the best of my knowledge. On the other hand, the consonant symbol modifier 'halant' is not used in WX, but is used in Indic-scripts and its presence might also cause disagreements about what the string length is. In other words, character as a unit does not work in your terms. In fact, who knows how many errors for Indic script text have made their way into computational results due to this simple fact. And perhaps they still do because it took me a long time to realize this, which at first led to consternation, because in text processing if you can't rely on the string length function, what can you rely on?
As for phonemes, major ML researchers like Vincent Ng don't believe it to be a real unit of language. The argument is that we don't need phonemes for applications like speech recognition.
If not byte and character, what are we left with in terms of computational phenomenology? At the very least there has to be such a well-argued and well-evidenced alternative in order to try to persuade others to agree to your views. I would be very much interested in thinking about such an alternative even if at present I don't think you are right about all your views. After all, to throw away millenia of work on language-science, very strong reasoning and evidence for an alternative is not an unrealistic expectation.
On Thu, Oct 26, 2023 at 8:44 PM Gilles Sérasset via Corpora < corpora@list.elra.info> wrote:
Hi Ada,
When my niece was 3 year old, she said to her little brother “Maman, elle venira plus tard…” (Mum will come back later, in “incorrect” French).
She made a “mistake" here by using “venira” (a wrong future form for verb venir (to come)) instead of the “correct" “viendra”. It was wrong, but perfectly predictable using the most productive morphological rules of French future formation.
She was 3 years old, so I doubt she was really understanding what morphology is, nevertheless, with this mistake, she clearly showed me that her way of learning languages did not consisted in reading/listening to huge amounts of utterances but she was able to learn some word formation rules from very few examples. And indeed, human is still able to perfectly learn complex things with very small explanation and/or very few example (something that is totally beyond ML based language models).
In my humble opinion, this proves that morphology exists, if not in the LLM matrixes, at least in the human brain. Hence modelling such rules (and even using them to analyse or produce) is a valid approach, independently of any other (also valid) approaches.
If I want to say it another way :
There has been many scientific proofs that human will not be able to fly… And these proofs were valid under their own hypothesis.
Indeed, planes do not flap their wings… they are using other ways to perform a task that was performed by birds.
Nevertheless, I have never been the witness of any plane (or pilot) trying to convince birds that their way of flying is obsolete (or issued from a colonialist point of view of Aves on the task at hand…) and asking them to renounce this oh so obsolete bad habit.
Regards,
Gilles,
Corpora mailing list -- corpora@list.elra.info https://list.elra.info/mailman3/postorius/lists/corpora.list.elra.info/ To unsubscribe send an email to corpora-leave@list.elra.info