[resending since it looked like my reply bounced due to size of thread]
---------- Forwarded message --------- From: Ada Wan adawan919@gmail.com Date: Mon, Aug 7, 2023 at 3:46 PM Subject: Re: [Corpora-List] Re: Any literature about tensors-based corpora NLP research with actual examples (and homework ;-)) you would suggest? ... To: Anil Singh anil.phdcl@gmail.com Cc: Hesham Haroon heshamharoon19@gmail.com, corpora < corpora@list.elra.info>
Hi Anil
Just a couple of quick points for now:
1. re encoding etc.: I know how involved and messy this "encoding" and "clean-up" issues are/were and could be for the non-ASCII world. I wish people who work on multilngual NLP and/or language technologies would give these issues more attention.
[Iiuc, these intros cover the standardized bit from Unicode (also good for other viewers at home, I mean, readers of this thread :) ): https://www.unicode.org/notes/tn10/indic-overview.pdf https://www.unicode.org/faq/indic.html]
Re "there is still no universally accepted way to actually use these input methods" and "Most Indians, for example, on social media and even in emails type in some pseudo-phonetic way using Roman letters with the QWERTY keyboard": actually this is true in many other parts of the non-ASCII world. Even in DE and FR, one has "gotten used to" circumventing the ASCII-conventions with alternate spellings, e.g. spelling out umlauts, dropping the accents.... Between i) getting people to just use script the "standardized way" and ii) weaning people from wordhood --- I wonder which would be harder. :) But I regard i) as a vivid showcase of human creativity, I don't think there is anything to "correct". As with ii), it could be interpreted as baggage accompanying "ASCII-centrism". Language is (re-)productive in nature anyway, there will always be novel elements or ways of expression that could be ahead of standardization. I don't think we should dictate how people write or express themselves digitally (by asking them to adhere to standards or grammar), but, instead, we (as technologists) should use good methods and technologies to deal with the data that we have. (N.B. one doesn't write in "words". One just writes. "Word count" is arbitrary, but character count is not --- unless one is in charge of the lower-level/technical design of such.)
For text-based analyses with "non-standard varieties or input idiosyncrasies", I suppose one would just have to have good methods in order to extract information from these (if so wished).
(Btw, iiuc, you are also of the opinion that there isn't much/anything else to work on as far as character encoding for scripts usually used in India or standardization on this front is concerned, correct? If not, please let it be known here. I bet/hope some Unicode folks might be on this thread?)
2. Re word and morphology: I suppose what I am advocating is that not only is there really no word (whatever that means etc.), there is also no morphology --- not as something intrinsic in language nor as a science. I understand that this can be hard to accept for those of us who have had a background in Linguistics or many language-related areas, because morphology has been part of an academic discipline for a while. But the point is that morphological analyses are really just some combination of post-hoc analyses based on traditional philological practices and preferences (i.e. ones with selection bias). It is not a universal decomposition method.
3. Re "but ... Sometimes it is wise to be silent": I'm afraid I'd have to disagree with you here. Before my experiences these past years, I had never wanted to be "famous"/"infamous". I had just wanted to work on my research. But in the course of these few difficult years, I realized part of the "complexity" lied in something that might have had made my whole life and my "(un-)career"/"atypical career path" difficult --- our language interpretation and our community's/communities' attitudes towards language (again, whatever "language" means!).
4. Re grammar: much of my "'no grammar' campaign" comes from my observation that many people seem to have lost it with language. The "word"/"grammar"-hacking phenomena, esp. in Linguistics/CL/NLP, exacerbate, otoh, our dependency on grammar, otoh, that on words --- when neither of these is real or necessary. And then the more I see these people advertising their work on social media, the more likely it is for them to misinform and influence others (may these be the general public or computer scientists who do not have much of a background in language or in the science of language), warping the their relationship with language. (The results of my resolving "morphological complexity" in "Fairness in Representation" could have been hard to experience for some.)
Re "I don't say it is necessary, but I see it as one possible model to describe language": Ok, to this "grammar is useful" argument: esp. considering how grammar and linguistics (as a subject) both suffered from selection bias, and both have some intrinsic problems in their approach with "judgments", even if such models might work sometimes in describing some data, I don't think it'd be ethical to continue pursuing/developing grammatical methods. When dealing with texts, character n-grams (based on actual data!!)** and their probabilities should suffice as a good description --- insofar as one actually provides a description. There is no need to call it "grammar". It just invokes tons of unnecessary dependencies. **And when over time and across datasets, one is able to generalize truths (may these be for language/communication/information/computation/mathematics/...) from these statistical accounts that hold, then there is a potential for good theories. But as of now, even in Linguistics, there is still a tendency to cherry-pick, in e.g. discarding/disregarding whitespaces. And esp. considering how little data (as in, so few good/verified datasets) we have in general, there is still quite some work to do for all when it comes to good data scientific theories.
Re "which can be useful for some -- like educational -- purposes if used in the right way": how grammar has been too heavily used in education is more of an ailment than a remedy or healthy direction. I think the only way for grammar to survive is to regard it as "style guides".
Re "particularly with non-native speakers of English ... sometimes your patience is severely tested": to that: yes, I understand/empathize, but no --- in that we ought to change how writing/language is perceived. Depending on the venue/publication, I think sometimes we ought to relax a bit with other's stylistic performance. The "non-native speakers of X" has been a plague in Linguistics for a while now. We have almost paved our way to some next-gen eugenics with that. Most of us on this mailing list are not submitting to non-scientific literary publications, I'd prefer better science and content to better writing styles at any time.
Re "magic": I think sometimes I do get some "magical" results with higher-dimensional models. So there is some "magic" (or ((math/truths)+data)? :) ) that is not so obvious. That, I am always glad to look further into.
Best Ada
On Sat, Aug 5, 2023 at 6:26 PM Anil Singh anil.phdcl@gmail.com wrote:
On Sat, Aug 5, 2023 at 6:56 PM Ada Wan adawan919@gmail.com wrote:
Hi Anil
Thanks for your comments. (And thanks for reading my work.)
Yeah, there is a lot that one has to pay attention to when it comes to what "textual computing" entails (and to which extent it "exists"). Beyond "grammar" definitely. But experienced CL folks should know that. (Is this you btw: https://scholar.google.com/citations?user=QKnpUbgAAAAJ?
Yes, that's me, for the better or for the worse.
If not, do you have a webpage for your work? Nice to e-meet you either way!)
Thank you.
Re "I know first hand the problems in doing NLP for low resource languages
which are related to text encodings": which specific languages/varieties are you referring to here? If the issue lies in the script not having been encoded, one can contact SEI about it (https://linguistics.berkeley.edu/sei/)? I'm always interested in knowing what hasn't been encoded. Are the scripts on this list ( https://linguistics.berkeley.edu/sei/scripts-not-encoded.html)?
Well, that's a long story. It is related to the history of adaptation of computers by the public at large in India. The really difficult part is not about scripts being encoded. Instead, it is about a script being over-encoded or encoded in a non-standard way. And the lack of adoption of standard encodings and input methods. Just to give one example, even though a single encoding (called ISCII) for all Brahmi-origin scripts of India was created officially, most people were unaware of it or didn't use it for so many reasons. One major reason being that it was not supported on Operating Systems, including Windows (which was anyway developed many years after creation of ISCII). Input methods and rendering engines for it were not available. You had to have a special terminal to use it, but that was text only terminal, used mainly in research centers and perhaps for some very limited official purposes. And computers, as far as the general public was concerned, were most commonly used for DeskTop Publishing (which became part of Indian languages as "DTP"). These non-standard encodings were mainly font-encodings, just to enable proper rendering of text for publishing. One of the most popular 'encodings' was based on the Remington typewriter for Hindi. Another was based on mostly phonetic mapping from Roman to Devanagari. Other languages which did not use Devanagari also had their own non-standard encodings, often multiple encodings. The reason these became popular was that they enabled people to type in Indian languages and see the text rendered properly, since no other option was available and they understandably didn't really care about it being standard or not. It wasn't until recently that Indic scripts were properly supported by any OS's. It is possible that even now, when Unicode is supported on most OS's and input methods are available as part of OS's, there are people still using non-standard encodings. Even now, you can come across problems related to either input methods or rendering for Indic scripts on OS's. And most importantly, there is still no universally accepted way to actually use these input methods. Most Indians, for example, on social media and even in emails type in some pseudo-phonetic way using Roman letters with the QWERTY keyboard. Typing in Indian languages using Indic scripts is still a specialized skill.
The result of all this is that when you try to collect data for low resource languages, including major languages of India, there may be a lot of data -- or perhaps even all the data, depending on the language -- which is in some non-standard ad-hoc encoding which has non-trivial mapping with Unicode. This is difficult partly because non-standard encodings are often based on glyphs, rather than actual units of the script. So, to be able to use it you need a perfect encoding converter to get the text in Unicode (UTF-8). Such converters have been there for a long time, but since they were difficult to create, they were/are proprietary and not available even to researchers in most cases. It seems a pretty good OCR system has been developed for Indic scripts/languages, but I have not yet had the chance to try it.
For example, I am currently (for the last few years) working on Bhojpuri, Magahi and Maithili. When we tried to collect data for these languages, there was the same problem, which is actually not really a problem for the general public because their purpose is served by these non-standard encodings, but for NLP/CL you face difficulty in getting the data in a usable form.
This is just a brief overview and I also don't really know the full extent of it, in the sense that I don't have a comprehensive list of such non-standard encodings for all Indic scripts.
Re the unpublished paper (on a computational typology of writing systems?): when and to where (as in, which venues/publications) did you submit it? I remember one of my first term papers from the 90s being on the phonological system of written Cantonese (or sth like that --- don't remember my wild days), the prof told me it wasn't "exactly linguistics"...
I had submitted to the journal Written Language and Literacy in 2009. It
was actually mostly my mistake that I didn't submit a revised version of the paper as I was going through a difficult period then.
Re "on building an encoding converter that will work for all 'encodings' used for Indian languages": this sounds interesting!
Yes, I still sometimes wish I could build it.
Re "I too wish there was a good comprehensive history text encodings, including non-standard ad-hoc encodings": what do you mean by that --- history of text encodings or historical text encodings? After my discoveries from recent years, when my "mental model" towards what's been practiced in the language space (esp. in CL/NLP) finally *completely *shifted, I had wanted to host (or co-host) a tutorial on character encoding for those who might be under-informed on the matter (including but not limited to the "grammaroholics" (esp. the CL/NLP practitioners who seem to be stuck doing grammar, even in the context of computing) --- there are so many of them! :) )
I mostly meant the non-standard 'encodings' (really just ad-hoc mappings) to serve someone's current purpose. To fully understand the situation, you have to be familiar with social-political-economic-etc. aspects of the language situation in India.
Re "word level language identification": I don't do "words" anymore. In that 2016 TBLID paper of mine, I (regrettably) was still going with the flow in under-reporting on tokenization procedures (like what many "cool" ML papers did). But "words" do certainly shape the results! I'm really forward to everyone working with full-vocabulary, pure character or byte formats (depending on the task), while being 100% aware of statistics. Things can be much more transparent and easily replicable/reproducible that way anyway.
Well, I used the word 'word' as just a shorthand for space separated segments. In my PhD thesis, I had also argued against word being the unit of computational processing or whatever you call it. I had called the unit Extra-Lexical Unit, consisting of a core morpheme and inflectional parts. I realize now that even that may not necessarily work for languages with highly fusional morphology. But, something like this is now the preferred unit of morphological processing, as in the CoNLL shared tasks and UniMorph. I also realize that I could not have been the first to come to this conclusion.
Re "We have to be tolerant of what you call bad research for various unavoidable reasons. Research is not what it used to be": No, I think one should just call out bad research and stop doing it. I wouldn't want students to burn their midnight oil working hard for nothing. Bad research warps also expectations and standards, in other sectors as well (education, healthcare, commerce... etc.). Science, as in the pursuit of truth and clarity, is and should be the number 1 priority of any decent research. (In my opinion, market research or research for marketing purposes should be all consolidated into one track/venue if they lack scientific quality.) I agree research is not what it used to be --- but in the sense that the quality is much worse in general, much hacking around with minor, incremental improvements. Like in the case of "textual computing", people are "grammar"-hacking.
I completely agree with you, but ... Sometimes it is wise to be silent.
Re *better ... gender representation": hhmm... I'm not so sure about that.
You are a better judge of that. I just shared my opinion, which may not be completely free from bias, although I do try.
Re "About grammar, I have come to think of it as a kind of language model for describing some linguistic phenomenon": nah, grammar not necessary.
I don't say it is necessary, but I see it as one possible model to describe language, which can be useful for some -- like educational -- purposes if used in the right way.
Re grammaroholic reviewers: yeah, there are tons of those in the CL/NLP space. I think many of them are only willing and/or able to critique on grammar. Explicit is that it shows that they don't want to check one's math and code --- besides, when most work on "words" anyway, there is a limit to how things are replicable/reproducible, esp. if on a different dataset. The implicit bit, however, is that I think there is some latent intent to introduce/reinforce the influence of "grammar" into the computing space. That, I do not agree with at all.
I should confess that I sometimes am guilty of that (pointing out grammatical mistakes) myself. However, the situation is complicated in countries like India due to historical and other reasons. I think the papers should at least be in a condition that they can be understood roughly as intended. This may not always be the case, particularly with non-native speakers of English, or people who are not yet speakers/writers of English at all. Now, perhaps no one knows better than myself that it is not really their fault completely, but as a reviewer, sometimes your patience is severely tested.
Re "magic": yes, once one gets over the hype, it's just work.
True, but what I said is based on where I am coming from (as in, to this position), which will take a really long time to explain. Of course, I don't literally mean magic.
Re "I have no experience of field work at all and that I regret, but it is
partly because I am not a social creature": one can be doing implicit and unofficial "fieldwork" everyday if one pays attention to how language is used.
That indeed I do all the time. I meant official fieldwork.