[Corpora-List] Re: Lemmas and Lemmatization [was Re: NIF: NLP Interchange Format]

25 Oct 2023

      @Ada
...
What do we do with students/graduates who were fed archaic ideals?
You give them full professorships. ;-)
- Hugh
On Wed, Oct 25, 2023 at 7:29 AM Ada Wan via Corpora corpora@list.elra.info
wrote:
...
Dear Christian, dear all [pls feel free to disregard if not interested]
Discussion on a separate thread is fine.

Re "whether or not lemmatization is a valid NLP task":

I must first clarify that I am not on this mailing list just for "NLP
issues" (however "NLP" should be defined or regardless of whether it should
exist as an area with "word"-based or non-general/generalizable methods
beyond "machine/computational/automatic processing with (text) data").
Machine processing with data, text/"language" or not, can be done without
"words" anyway.
I did not just question whether lemmatization is a valid NLP task. I
questioned (the necessity/validity of) morphology in general, which would
affect the practice of lemmatization.

Re "lemmas were not invented for NLP":

it depends what "lemmas" and "NLP" refer to. The study of
morphology/morphs/morphemes certainly dates back quite some time (e.g.
Panini --- in terms of decomposing a "word" into smaller parts). BUT the
practice of naming segments as "lemmas" (and not "morphs"/"morphemes") and
the use of the term "lemma" might have come from computational
linguists/lexicographers and/or computing. Computing practices might have
reinforced the practice of lemmatization/segmentation throughout the past
decades, since back in the days (e.g. 1960s-1970s? [1]) when memory was
more of an issue or when linguistic techniques were leveraged when
computing with text.

Re "Bronze Age dictionaries/word lists of cuneiform languages":

i. some of these are effects of interpretation (much of which dated back
to the modern era, e.g. papyrology from the 19th century);
ii. I do not argue against the possibility/practice of decomposition in
general, but (linguistic) morphology is not a general decomposition
approach for its being based on a notion of "stemhood" that can be
arbitrary, indeterminate, "culture"/context-specific, and/or idiosyncratic
(recall many hard-to-decipher symbols/graphemes on many ancient
manuscripts). A more general method would be to decompose in a granularity
that is fine enough and recompose based on frequency (as that's also often
a pivotal criterion for empirical analyses and interpretations).

Re "the use of head words in dictionaries is a practice that won't go

away as long as people are going to use dictionaries ... for language
learning":
many lexical resources (e.g. dictionaries) are based on character n-grams
and do not leverage the notion of "head words". The notion of "morphology"
is hence orthogonal/irrelevant.
(Remark:
a few decades ago, it might have been much easier in some parts of the
world to get clarity on this --- just by walking into a bookstore or a
library and looking at the plethora of lexical resources --- of different
types/formats/designs, in general or for particular disciplines. But that
practice seems to have (almost) become a lost art now.
For "language learning", I'd recommend the immersion method. Nothing beats
experiencing communication in multi-dimensional, full-bodied contexts. Use
lexical resources only as mnemonics of sorts (don't become too
pedantic/addicted on such). Use style guides (or "grammar textbooks") only
when pleasing others is necessary. :) Just thought to note to those on this
list who might be interested.)

Re "inflection patterns": please see my reply to Orhan earlier today:

[tweet/x]
The solution can be adapted for "(morphological) segmentation" as well.
Please let me know if it is fine to you or if you have any objections.

Re "low resource ... just plain legacy word lists and grammar sketches":

if one works in data collection: for varieties that are still alive, one
should record raw and full data when possible and retire the ("colonial")
practice of elicitation based on "words". One could also try to obtain
parallel data in larger spans instead. For varieties that are extinct, one
archives what one has.
For what purposes should any "word"-based practices or linguistic
morphology be involved?

"won't go away in corpus linguistics and the philologies":

May it be for corpus linguistics, the philologies, the humanities and the
social sciences --- digital or not, for "practical" purposes or not,
everything (methods, approaches, interpretations, reception... etc.) can be
updated.

Re "[w]hether or not the use of lemmas ... is a valid task depends on

the use case" and "data modeling":
sure, the use of tools can depend on the purpose of the task. But the
issue here is: if the use of lemmas is only good for the task of
lemmatization, and if the use of lemmatization with text data is only good
for linguistic morphology, and when morphology is found not (or no longer)
relevant/useful/correct/appropriate, what do we do with a curriculum that
overfits on one representation granularity that does not have a solid
foundation? What do we do with students/graduates who were fed archaic
ideals?
Best
Ada
(Some often forget that I am also a linguist, not just a "computational
person", among other roles/interests.)
[1] My dating references here are supported by: "Algorithms for stemming
have been studied in computer science since the 1960s." and "The first
published stemmer was written by Julie Beth Lovins in 1968." (
https://en.wikipedia.org/wiki/Stemming)
I would've guessed from the 1950s otherwise....
On Wed, Oct 18, 2023 at 9:05 AM Christian Chiarcos via Corpora <
corpora@list.elra.info> wrote:
...
Dear Ada, dear all,
I think it's necessary to discuss this in a separate thread. As for Hugh,
he had a practical problem with an existing data set and we can discuss
specific solutions for that. As for Ada, whether or not lemmatization is a
valid NLP task can be discussed, as well, but this has absolutely nothing
to do with the very concrete request for advice on a real problem at hand.
I really don't want to dive into this, but focus on the first part. Of
course, there are applications where lemmatization as an NLP task was
assumed to be necessary but is no longer needed. But lemmas were not
invented for NLP, they were invented for structuring dictionaries and
describing morphology actually several millenia before the computer (I'm
thinking of Bronze Age dictionaries/word lists of cuneiform languages here,
used for teaching Sumerian, but there even in our 3rd m. BCE Sumerian
cuneiform corpus from the time when it was still spoken, there was a notion
of lemma or head word, and scribes sometimes just wrote that because they
were to lazy to write the full morphology). And the use of head words in
dictionaries is a practice that won't go away as long as people are going
to use dictionaries (be they digital or not) for language learning. And
that's equally true for writing textbook grammars and for teaching
morphology (you need some kind of base form to describe your inflection
patterns), as it is for rule-based morphology (that won't go away, either,
even though the use case is more on the low resource side of things ... low
resource meaning few corpus data, no parallel data, just plain legacy word
lists and grammar sketches). And also, it won't go away in corpus
linguistics and the philologies, at least not for use cases where people
come from a dictionary perspective.
Whether or not the use of lemmas (note that the question was actually not
about lemmatization, but about data modelling) is a valid task depends on
the use case. Working with humanists that want that because it's their
established practice is a valid use case. We can debate with them, of
course, but they are the experts on their use case, and I'd prefer to
devote my energy to something more practically relevant, like getting them
away from using MS Office for annotations or dictionaries and to use any
tool that produces structured output, instead. And already this can be a
hard problem that might eventually kill an otherwise interesting project.
(Apologies, that's not true of everyone, of course, but those cases exist,
and even where people understand the necessity, we still have to work with
decades of legacy data to bring into shape.) As for the role of
lemmatization in NLP, please continue to discuss without me.
@Ada, you seem to have a very concrete idea in mind how to get humanists
away from getting lemmas. I guess that could be an interesting discussion
at a conference on DH or language learning -- because this is where the
requirement comes from.
Best,
Christian
Am Di., 17. Okt. 2023 um 19:45 Uhr schrieb Bilgin, Orhan (Postgraduate
Researcher) via Corpora corpora@list.elra.info:
...
Dear Ada,
I agree that lemmatisation is a construct and is not a universal method
for linguistic analyses, but I don't understand why it is imperative that I
wean myself from using lemmas.
What is it that restricts my freedom to invent the lemma (a
non-universal construct) AĞAÇ-, for example, to refer to the one and only
"meaningful thing" that is common to the very many (theoretically infinite,
practically probably around 10,000) strings including ağaç, ağacı, ağaca,
ağaçlar, ağacımızdaki, ağaçlandırılabilmesinden, ağaçsızlaşmasını, etc.
etc.? How (and why) am I supposed to talk about that very large set without
using a label for it?
Best,
Orhan Bilgin
On 17 Oct 2023 18:36, Ada Wan via Corpora corpora@list.elra.info
wrote:
*This email originated outside the University. Check before clicking
links or attachments.*
Dear Christian
Re your PS:
one doesn't need to debate the use/future of lemmatization, though I'd
welcome such as part of scholarship. For those experienced in matters in/of
Linguistics, it should be clear that lemmatization was simply a cconstruct,
a entry-level philological exercise (esp. for those from Computer Science
with less of a background in Linguistics and language(s)). It has been sad
that some have picked up the habit of using lemmatization as a heuristic
(though for what, specifically?) and might have become, apparently, too
addicted to it to let it go. It is imperative that one weans themselves
from such habit.
Methods for linguistic morphology, e.g. (morphological) parsing or
stemming, are not a universal decomposition scheme, nor a universal method
for language/linguistic analyses. Also important is to bear in mind is that
neither linguistic morphology nor lemmas/lemmata doesn't/don't have that
long of a history.
Thanks for being open-minded enough to read this far.
Best
Ada

Corpora mailing list -- corpora@list.elra.info
https://list.elra.info/mailman3/postorius/lists/corpora.list.elra.info/
To unsubscribe send an email to corpora-leave@list.elra.info

Corpora mailing list -- corpora@list.elra.info
https://list.elra.info/mailman3/postorius/lists/corpora.list.elra.info/
To unsubscribe send an email to corpora-leave@list.elra.info

Corpora mailing list -- corpora@list.elra.info
https://list.elra.info/mailman3/postorius/lists/corpora.list.elra.info/
To unsubscribe send an email to corpora-leave@list.elra.info

2026

2025

2024

2023

2022

[Corpora-List] Re: Lemmas and Lemmatization [was Re: NIF: NLP Interchange Format]