*FinCausal 2023: Financial Document Causality Detection*
We are glad to announce that the Training Dataset for both English and
Spanish is released and ready on Codalab in this link:
https://codalab.lisn.upsaclay.fr/competitions/14596
Please register on CodaLab and get to the FInCausal.2023 Competition.
Under Participate, you will find the Training Datasets together with a
Starting Kit to guide you through the Task.
###### *Task Description and Important Links *#######
*FinCausal-2023 Shared Task: “Financial Document Causality Detection” *is
organised within the *5th Financial Narrative Processing Workshop (FNP
2023)* taking place in the 2023 IEEE International Conference on Big Data
(IEEE BigData 2023) <http://bigdataieee.org/BigData2023/>, Sorrento, Italy,
15-18 December 2023. It is a *one-day event*.
Workshop URL: https://wp.lancs.ac.uk#####cfie/fincausal2023/
<https://wp.lancs.ac.uk/cfie/fincausal2023/>
###### *Additional Information *#######
*Shared Task Description:*
Financial analysis needs factual data and an explanation of the variability
of these data. Data state facts but need more knowledge regarding how these
facts materialised. Furthermore, understanding causality is crucial in
studying decision-making processes.
The *Financial Document Causality Detection Task* (FinCausal) aims at
identifying elements of cause and effect in causal sentences extracted from
financial documents. Its goal is to evaluate which events or chain of
events can cause a financial object to be modified or an event to occur,
regarding a given context. In the financial landscape, identifying cause
and effect from external documents and sources is crucial to explain why a
transformation occurs.
Two subtasks are organised this year. *English FinCausal subtask *and* Spanish
FinCausal subtask*. This is the first year where we introduce a subtask in
Spanish.
*Objective*: For both tasks, participants are asked to identify, given a
causal sentence, which elements of the sentence relate to the cause, and
which relate to the effect. Participants can use any method they see fit
(regex, corpus linguistics, entity relationship models, deep learning
methods) to identify the causes and effects.
*English FinCausal subtask*
- *Data Description: *The dataset has been sourced from various 2019
financial news articles provided by Qwam, along with additional SEC data
from the Edgar Database. Additionally, we have augmented the dataset from
FinCausal 2022, adding 500 new segments. Participants will be provided with
a sample of text blocks extracted from financial news and already labelled.
- *Scope: *The* English FinCausal subtask* focuses on detecting causes
and effects when the effects are quantified. The aim is to identify, in
a causal sentence or text block, the causal elements and the consequential
ones. Only one causal element and one effect are expected in each segment.
- *Length of Data fragments: *The* English FinCausal subtask* segments
are made up of up to three sentences.
- *Data format: *CSV files. Datasets for both the English and the
Spanish subtasks will be presented in the same format.
This shared task focuses on determining causality associated with a
quantified fact. An event is defined as the arising or emergence of a new
object or context regarding a previous situation. So, the task will
emphasise the detection of causality associated with the transformation of
financial objects embedded in quantified facts.
*Spanish FinCausal subtask*
- *Data Description: *The dataset has been sourced from a corpus of
Spanish financial annual reports from 2014 to 2018. Participants will be
provided with a sample of text blocks extracted from financial news,
labelled through inter-annotator agreement.
- *Scope: *The *Spanish FinCausal subtask* aims to detect all types of
causes and effects, not necessarily limited to quantified effects. The
aim is to identify, in a paragraph, the causal elements and the
consequential ones. Only one causal element and one effect are expected in
each paragraph.
- *Length of Data fragments: *The *Spanish FinCausal subtask* involves
complete paragraphs.
- *Data format: *CSV files. Datasets for both the English and the
Spanish subtasks will be presented in the same format.
This shared task focuses on determining causality associated with both
events or quantified facts. For this task, a cause can be the justification
for a statement or the reason that explains a result. This task is also a
relation detection task.
Best regards,
FinCausal 2023 Team
Dear all,
We are looking for a Research Assistant/Associate to work on the project Automated Verification of Textual Claims (AVeriTeC), an ERC project led by Prof. Andreas Vlachos at the University of Cambridge starting in January 2024 or as soon as possible thereafter. The position is for two years. The successful candidate will be based in the Natural Language and Information Processing group (http://www.cl.cam.ac.uk/research/nl/) at the Department of Computer Science and Technology. The project focuses on developing approaches enabling the verification of highly complex claims, which require multiple pieces of evidence. Special focus will be paid to accompanying the verdicts with suitable justifications.
Candidates will have completed a Ph.D. (or be close to completing it) in a relevant field such as NLP, Information Retrieval, Artificial Intelligence or Machine Learning and be able to demonstrate a strong track record of independent research and high-quality publications. Essential skills include excellent programming (Python), NLP techniques, Machine Learning, and proven communication skills.
Appointment at research associate level is dependent on having a PhD or having equivalent skills and experience through non-academic routes. Where a PhD has yet to be awarded appointment will initially be made as a research assistant and amended to research associate when the PhD is awarded.
Enquiries concerning this position should be directed to Prof. Andreas Vlachos (av308(a)cam.ac.uk), and applicants are encouraged to contact him regarding the position.
Apply using this link:
https://www.jobs.cam.ac.uk/job/42366/
Thanks,
Andreas
Dear all,
We are organising a free hybrid event (online and in person): Language Data Analysis for Business and Professional Communication.
It will take place on 22 September 2023 10:00 - 15:30 UK time.
More details and registration: https://cass.lancs.ac.uk/mycalendar-events/?event_id1=3470
The ESRC Centre for Corpus Approaches to Social Science, Lancaster University offers a practical training workshop focused on computational analysis of language data for businesses and professional organisations and anyone interested in communication in professional contexts. The data includes social media, newspapers, business reports, marketing materials and other data sources.
The workshop will introduce a new software tool #LancsBox X<https://lancsbox.lancs.ac.uk/> developed at Lancaster University, which can analyse and visualise large amounts of language data (millions and billions of words). Practical examples of uses of #LancsBox X (case studies) will be provided.
Best,
Vaclav
Professor Vaclav Brezina
Professor in Corpus Linguistics
Department of Linguistics and English Language
ESRC Centre for Corpus Approaches to Social Science
Faculty of Arts and Social Sciences, Lancaster University
Lancaster, LA1 4YD
Office: County South, room C05
T: +44 (0)1524 510828
[cid:02038f34-9006-4e55-866f-0bc6e77964ba]@vaclavbrezina
[cid:30848364-59cf-4561-a71f-9c3e4ac4519c]<http://www.lancaster.ac.uk/arts-and-social-sciences/about-us/people/vaclav-…>
Dear Colleagues,
I'm delighted to share that The 6th Workshop on Financial Technology and
Natural Language Processing (FinNLP) will be collocated with
IJCNLP-AACL-2023 (http://www.ijcnlp-aacl2023.org/) in Bali, Indonesia
(November 1–4).
The submission due for the main track paper is *Sep. 8th, 2023*. For more
details, please refer to the website:
https://sites.google.com/nlg.csie.ntu.edu.tw/finnlp2023/home
We continue the *ML-ESG shared task* in FinNLP, and share new labels
related to Impact Type Identification. You can also get more details about
the shared task on the FinNLP website. The Registration Form is ready:
https://forms.gle/j6gL5jy1upq5LrKY9
Participants in previous FinNLP share various insights into NLP in FinTech
applications. Please refer to past years' proceedings for more details:
https://aclanthology.org/venues/finnlp/
Feel free to let us know if you have any questions. Looking forward to
seeing you at the 6th FinNLP.
Best Regards,
Chung-Chi
---
陳重吉 (Chung-Chi Chen), Ph.D.
Researcher
Artificial Intelligence Research Center, National Institute of Advanced
Industrial Science and Technology, Japan
E-mail: c.c.chen(a)acm.org <cjchen(a)nlg.csie.ntu.edu.tw>
Website: http://cjchen.nlpfin.com
Dear all,
Forwarding on behalf of the Arizona Linguistics Circle 17 (ALC 17)
<https://sites.google.com/view/arizonalinguisticscircle17/home?authuser=3>
committee:
We are happy to announce the 17th annual ALC conference which will take
place Halloweekend🎃 (October 27 to 29, 2023) at the University of Arizona
in Tucson, Arizona.
The theme for the conference is "Collaborative work in Linguistics:
Communities and Societies." We welcome graduate students, undergraduate
students, and language workers in communities to submit a proposal.
In the coming weeks we will announce our invited plenary speakers for ALC
17.
Please see below for the Call for Papers which can also be found via this
<https://easychair.org/cfp/ALC17>EasyChair <https://easychair.org/cfp/ALC17>
link.
Arizona Linguistics Circle 17
Collaborative work in Linguistics: Communities and Societies
University of Arizona
Tucson, Arizona
October 27 to 29, 2023
Keynote Speakers: TBA
Call for Papers Submissions
We are pleased to invite talk proposals for Arizona Linguistics Circle 17
(ALC 17)
<https://sites.google.com/view/arizonalinguisticscircle17/home?authuser=3>.
ALC 17 is an annual graduate student-run conference held at the University
of Arizona. Our goal is to foster a deeper appreciation for linguistics
while providing a healthy environment for academic discussion, especially
as it concerns graduate student research.
The theme of this year’s conference is Collaborative work in Linguistics,
that is the ways in which we have worked to adapt to the unique challenges
of a pandemic, found new ways to conduct research remotely, and made
ethical considerations in using different modalities. However, abstracts
from all areas of linguistics are welcome and encouraged.
This year, we plan to conduct a hybrid conference, with both in-person and
remote aspects (see Covid contingency)
Abstract Guidelines:
-
Abstract may not exceed 500 words (not including keywords, references,
figures, and tableaux)
-
Presentations will consist of 20-minute talks with 10-minute Q&A
-
Authors are limited to one individual and one joint abstract (not
including workshop submissions)
-
Only anonymized submissions will be accepted
-
Abstract submission deadline Friday, September 1, 2023
Abstract submission is via EasyChair <https://easychair.org/cfp/ALC17>. For
questions regarding abstract submission, please contact the abstract review
manager at azlingcircle17(a)gmail.com. Notification of acceptance will be
sent in late August.
Proceedings
Presenters will be invited to submit their paper for publication in Coyote
Papers <https://coyotepapers.sbs.arizona.edu/>, the conference proceedings
for ALC.
Call for Workshop Submissions
We are also inviting proposals for one- to two-hour long workshops on the
theme of Collaborative work in Linguistics within the realm of
experimentation, fieldwork, and data collection. Example workshops include,
but are not limited to, improving sound quality in remote fieldwork,
getting the most from online crowdsourcing or experimentation platforms,
setting up remote speech collection experiments, and more. Workshop
submissions from students are especially welcome.
Workshop abstracts may not exceed 500 words and must include a title,
detailed description (topic, format, length, and content), technical
prerequisites (hardware, software, or platforms), applicant's areas of
expertise, and why the workshop would be of interest to ALC attendees.
Workshop abstracts are submitted on EasyChair and do not count towards the
limit on paper abstract submissions. Authors of accepted workshops will be
offered modest honoraria.
Covid contingency
Our plan is to enjoy an in-person conference. However, we might consider a
hybrid conference, such that presenters will have the option to present
in-person or remotely. This way, both domestic and international presenters
will have the opportunity to deliver their presentations regardless of the
Covid situation or vaccine availability of their location. We will also
have a contingency plan to move the conference to a virtual format if
Arizona is undergoing a significant Covid outbreak at the time of the
conference.
Please let us know if you have any questions via email:
azlingcircle17(a)gmail.com. We look forward to sharing more information with
you over the coming months as we prepare for ALC 17.
Attentively,
Jesús E. González Franco (ALC 17 co-chair)
--
Dr Heather Froehlich
w // http://hfroehli.ch
t // @heatherfro
Hello,
We are pleased to announce that the University of Geneva will be hosting
the 3rd Symposium on Artificial Intelligence for Industry, Science, and
Society (AI2S2) from 11 to 15 September 2023, at Campus Biotech, Geneva. *Both
in-person and remote-only registration are free*; registration closes soon,
thus we encourage you to register ASAP if you intend to participate.
We will be welcoming several renowned experts from each of the event
pillars (industry, science, and society), and have an exciting schedule
covering a wide variety of topics of interest to the community, including
keynote presentations from the following experts:
- *Alistair Knott*, Professor of AI; Victoria University of Wellington
- *Andrew Wyckoff*, Former Director for Science, Technology, and
Innovation; OECD
- *Ger Janssen*, Principal Scientist for AI & Data Science & Digital
Twin; Philips
- *I**nma Martinez*, Chair of the Multi-Stakeholder Committee; GPAI
- *Juha Heikkilä*, International Advisor for AI; European Commission
- *Laure Soulier*, Professor HDR-ISIR Lab; Sorbonne University
- *Maria Girone*, Head of OpenLab; CERN
- *Michael Bronstein*, DeepMind Professor of AI; Oxford University
- *Philippe Limantour*, Chief Technology and Cybersecurity Officer;
Microsoft France
- *Stephen MacFeely*, Director of Data and Analytics; World Health
Organisation
You can find more information about the event at the following links:
- Main event website: https://ai2s2.org/2023/
- Event timetable/agenda: https://ai2s2.org/2023/event/
- Registration (in person): link
<https://indico.cern.ch/event/1288077/page/29800-general-registration>
- Registration (remote only): link
<https://indico.cern.ch/event/1288077/page/29802-remote-registration>
We hope to see many of you next month!
Best regards,
The AI2S2 2023 organisers
Hi all,
Announcing the ACM Multimedia 2023 Workshop On Generative AI And Multimodal LLM For Finance (GenAI4Finance).
Looking forward to receiving exciting submissions on the broad theme of algorithms and applications of Generative AI (Diffusion Models, GANS, LLMs, Transformers, ChatGPT, Opensource LLMs, etc) in Finance!
Paper deadline: Aug 10 '23
Website: https://generativeai4finance-mm23.github.io/
Papers accepted at the conference will be published in ACM proceedings and presented at ACM MM in Ottawa, Canada. Remote as well as in-person presentations are allowed!
Jointly organized by researchers from the University of Maryland Georgia Institute of Technology Adobe MBZUAI (Mohamed bin Zayed University of Artificial Intelligence) and The Philipp University of Marburg.
CALL FOR PAPERS:
About The GenAI4Finance Workshop
Trading and investments in financial markets are useful mechanisms for wealth creation and poverty alleviation. Financial forecasting is an essential task that helps investors make sound investment decisions. It can assist users in better predicting price changes in stocks and currencies along with managing risk to reduce chances of adverse losses and bankruptcies. The world of finance is characterized by millions of people trading numerous assets in complex transactions, giving rise to an overwhelming amount of data that can indicate the future direction of stocks, equities, bonds, currencies, crypto coins, and non-fungible tokens (NFTs). Traditional statistical techniques have been extensively studied for stock market trading in developed economies. However, these techniques hold less relevance with rising automation in trading, the interconnected nature of the economies, and the advent of high-risk, high-reward cryptocurrencies and decentralized financial products like NFTs. Today, the world of finance has been brought ever closer to technology as technological innovations become a core part of financial systems. Hence, there is a need to systematically analyze how advances in Generative Artificial Intelligence and Large Language Modeling can help citizen investors, professional investment companies and novice enthusiasts better understand the financial markets.
Relevance Of Multimodal LLMs And Generative AI In Finance
The biggest hurdle in processing unstructured data through LLMs like ChatGPT, GPT-3.5, Multimodal GPT-4, PaLM, and others is their opaque nature that they learn from large internet-sized corpora. Lack of understanding of what, when, and how these models can be utilized for financial applications remains an unsolved problem. While these models have shown tremendous success in NLP and document processing tasks, their study on financial knowledge has been very limited. The difficulty in automatically capturing real-time stock market information by these LLMs impedes their successful application for traders and finance practitioners. The learning abilities of LLMs on streaming news sentiment expressed through different multimedia has been barely touched upon by researchers in finance.
Artificial Intelligence based methods can help bridge that gap between the two domains through inter-sectional studies. Even though there have been scattered studies in understanding these challenges, there is a lack of a unified forum to focus on the umbrella theme of generative AI + multimodal language modeling for financial forecasting. We expect this workshop to attract researchers from various perspectives to foster research on the intersection of multimodal AI and financial forecasting. We also plan to organize one shared task to encourage practitioners and researchers to gain a deeper understanding of the problems and come up with novel solutions.
Topics:
This workshop will hold a research track and a shared task. The research track aims to explore recent advances and challenges of Multimodal Generative LLMs for finance. Researchers from artificial intelligence, computer vision, speech processing, natural language processing, data mining, statistics, optimization, and other fields are invited to submit papers on recent advances, resources, tools, and challenges on the broad theme of Multimodal AI for finance. The topics of the workshop include but are not limited to the following:
Generative AI applications in finance
LLM evaluation for finance
Fine-tuning LLMs for finance
Pre-training strategies for Financial LLMs
Retrieval augmentations for interpretable finance
Hallucinations in Financial LLMs and Generative AI
Machine Learning for Time Series Data
Audio-Video Processing for Financial Unstructured Data
Processing text transcript with audio/video for audio-visual-textual alignment, information extraction, salient event detection, entity linking
Conversational dialogue modeling using LLMs
Natural Language Processing Applications in Finance
Named-entity recognition, relationship extraction, ontology learning in financial documents
Multimodal financial knowledge discovery
Data construction, acquisition, augmentation, feature engineering, and analysis for investment and risk management
Bias analysis and mitigation in financial LLMs and datasets
Interpretability and explainability for financial Generative AI models and Multimodal LLMs
Privacy-preserving AI for finance
Video understanding of financial multimedia
Vision + language and/or other modalities
Time series modeling for financial applications
Important Dates
- June 1, 2023: ACM MM 2023 Workshop Website Open
- August 10, 2023: Submission Deadline
- August 15, 2023: Paper notification
- August 18, 2023: Camera-Ready Deadline
All deadlines are “anywhere on earth” (UTC-12)
Submission:
Authors are invited to submit their unpublished work that represents novel research. The papers should be written in English using the ACM Multimedia-23 author kit and follow the ACM Multimedia 2023 formatting guidelines. Authors can also submit supplementary materials, including technical appendices, source codes, datasets, and multimedia appendices. All submissions, including the main paper and its supplementary materials, should be fully anonymized. For more information on formatting and anonymity guidelines, please refer to ACM Multimedia 2023 call for papers page. Reviewing: At least two reviewers with matching technical expertise will review each paper.
Each paper should be accompanied by one workshop registration.
Multiple submissions of the same paper to more ACM Multimedia-2023 workshops are forbidden.
All papers will be double-blind peer-reviewed. The workshop accepts both long papers and short papers:
Short Paper: Up to 4 pages of content, plus unlimited pages for references and appendix. Upon acceptance, the authors are provided with 1 more page to address the reviewer's comments.
Long Paper: Up to 8 pages of content, plus unlimited pages for references and appendix. Upon acceptance, the authors are provided with 1 more page to address the reviewer's comments.
Two reviewers with the same technical expertise will review each paper. Authors of the accepted papers will present their work in either the Oral or Poster session. All accepted papers will appear in the workshop proceedings to be published in ACM Multimedia 2023 proceedings. The authors will keep the copyright of their published papers.
The paper must be submitted using EasyChair (https://easychair.org/my/conference?conf=genai4finance).
In addition to advertising a permanent postdoctoral research position (see arg.tech/2023PDRA01<http://arg.tech/2023PDRA01>, deadline 14 August 2023) the Centre for Argument Technology is now also opening a call for applications for fully-funded PhD studentships. Applications are invited across the full range of interdisciplinary research in the Centre, with more details available at arg.tech/2023PHD01<http://arg.tech/2023PHD01>. Closing date for the PhD studentships is 23 September 2023.
The University of Dundee is a registered Scottish Charity, No: SC015096
[resending since it looked like my reply bounced due to size of thread]
---------- Forwarded message ---------
From: Ada Wan <adawan919(a)gmail.com>
Date: Mon, Aug 7, 2023 at 3:46 PM
Subject: Re: [Corpora-List] Re: Any literature about tensors-based corpora
NLP research with actual examples (and homework ;-)) you would suggest? ...
To: Anil Singh <anil.phdcl(a)gmail.com>
Cc: Hesham Haroon <heshamharoon19(a)gmail.com>, corpora <
corpora(a)list.elra.info>
Hi Anil
Just a couple of quick points for now:
1. re encoding etc.:
I know how involved and messy this "encoding" and "clean-up" issues
are/were and could be for the non-ASCII world. I wish people who work on
multilngual NLP and/or language technologies would give these issues more
attention.
[Iiuc, these intros cover the standardized bit from Unicode (also good for
other viewers at home, I mean, readers of this thread :) ):
https://www.unicode.org/notes/tn10/indic-overview.pdfhttps://www.unicode.org/faq/indic.html]
Re "there is still no universally accepted way to actually use these input
methods" and "Most Indians, for example, on social media and even in emails
type in some pseudo-phonetic way using Roman letters with the QWERTY
keyboard":
actually this is true in many other parts of the non-ASCII world. Even in
DE and FR, one has "gotten used to" circumventing the ASCII-conventions
with alternate spellings, e.g. spelling out umlauts, dropping the
accents.... Between i) getting people to just use script the "standardized
way" and ii) weaning people from wordhood --- I wonder which would be
harder. :) But I regard i) as a vivid showcase of human creativity, I don't
think there is anything to "correct". As with ii), it could be interpreted
as baggage accompanying "ASCII-centrism". Language is (re-)productive in
nature anyway, there will always be novel elements or ways of expression
that could be ahead of standardization. I don't think we should dictate how
people write or express themselves digitally (by asking them to adhere to
standards or grammar), but, instead, we (as technologists) should use good
methods and technologies to deal with the data that we have. (N.B. one
doesn't write in "words". One just writes. "Word count" is arbitrary, but
character count is not --- unless one is in charge of the
lower-level/technical design of such.)
For text-based analyses with "non-standard varieties or input
idiosyncrasies", I suppose one would just have to have good methods in
order to extract information from these (if so wished).
(Btw, iiuc, you are also of the opinion that there isn't much/anything else
to work on as far as character encoding for scripts usually used in India
or standardization on this front is concerned, correct? If not, please let
it be known here. I bet/hope some Unicode folks might be on this thread?)
2. Re word and morphology:
I suppose what I am advocating is that not only is there really no word
(whatever that means etc.), there is also no morphology --- not as
something intrinsic in language nor as a science. I understand that this
can be hard to accept for those of us who have had a background in
Linguistics or many language-related areas, because morphology has been
part of an academic discipline for a while. But the point is that
morphological analyses are really just some combination of post-hoc
analyses based on traditional philological practices and preferences (i.e.
ones with selection bias). It is not a universal decomposition method.
3. Re "but ... Sometimes it is wise to be silent":
I'm afraid I'd have to disagree with you here. Before my experiences these
past years, I had never wanted to be "famous"/"infamous". I had just wanted
to work on my research.
But in the course of these few difficult years, I realized part of the
"complexity" lied in something that might have had made my whole life and
my "(un-)career"/"atypical career path" difficult --- our language
interpretation and our community's/communities' attitudes towards language
(again, whatever "language" means!).
4. Re grammar:
much of my "'no grammar' campaign" comes from my observation that many
people seem to have lost it with language. The "word"/"grammar"-hacking
phenomena, esp. in Linguistics/CL/NLP, exacerbate, otoh, our dependency on
grammar, otoh, that on words --- when neither of these is real or
necessary. And then the more I see these people advertising their work on
social media, the more likely it is for them to misinform and influence
others (may these be the general public or computer scientists who do not
have much of a background in language or in the science of language),
warping the their relationship with language. (The results of my resolving
"morphological complexity" in "Fairness in Representation" could have been
hard to experience for some.)
Re "I don't say it is necessary, but I see it as one possible model to
describe language":
Ok, to this "grammar is useful" argument:
esp. considering how grammar and linguistics (as a subject) both suffered
from selection bias, and both have some intrinsic problems in their
approach with "judgments", even if such models might work sometimes in
describing some data, I don't think it'd be ethical to continue
pursuing/developing grammatical methods.
When dealing with texts, character n-grams (based on actual data!!)** and
their probabilities should suffice as a good description --- insofar as one
actually provides a description. There is no need to call it "grammar". It
just invokes tons of unnecessary dependencies.
**And when over time and across datasets, one is able to generalize truths
(may these be for
language/communication/information/computation/mathematics/...) from these
statistical accounts that hold, then there is a potential for good
theories. But as of now, even in Linguistics, there is still a tendency to
cherry-pick, in e.g. discarding/disregarding whitespaces. And esp.
considering how little data (as in, so few good/verified datasets) we have
in general, there is still quite some work to do for all when it comes to
good data scientific theories.
Re "which can be useful for some -- like educational -- purposes if used in
the right way":
how grammar has been too heavily used in education is more of an ailment
than a remedy or healthy direction.
I think the only way for grammar to survive is to regard it as "style
guides".
Re "particularly with non-native speakers of English ... sometimes your
patience is severely tested":
to that: yes, I understand/empathize, but no --- in that we ought to change
how writing/language is perceived. Depending on the venue/publication, I
think sometimes we ought to relax a bit with other's stylistic performance.
The "non-native speakers of X" has been a plague in Linguistics for a while
now. We have almost paved our way to some next-gen eugenics with that. Most
of us on this mailing list are not submitting to non-scientific literary
publications, I'd prefer better science and content to better writing
styles at any time.
Re "magic":
I think sometimes I do get some "magical" results with higher-dimensional
models. So there is some "magic" (or ((math/truths)+data)? :) ) that is not
so obvious. That, I am always glad to look further into.
Best
Ada
On Sat, Aug 5, 2023 at 6:26 PM Anil Singh <anil.phdcl(a)gmail.com> wrote:
> On Sat, Aug 5, 2023 at 6:56 PM Ada Wan <adawan919(a)gmail.com> wrote:
>
>> Hi Anil
>>
>> Thanks for your comments. (And thanks for reading my work.)
>>
>> Yeah, there is a lot that one has to pay attention to when it comes to
>> what "textual computing" entails (and to which extent it "exists"). Beyond
>> "grammar" definitely. But experienced CL folks should know that. (Is this
>> you btw: https://scholar.google.com/citations?user=QKnpUbgAAAAJ?
>>
>
> Yes, that's me, for the better or for the worse.
>
>
>> If not, do you have a webpage for your work? Nice to e-meet you either
>> way!)
>>
>>
> Thank you.
>
> Re "I know first hand the problems in doing NLP for low resource languages
>> which are related to text encodings":
>> which specific languages/varieties are you referring to here? If the
>> issue lies in the script not having been encoded, one can contact SEI about
>> it (https://linguistics.berkeley.edu/sei/)? I'm always interested in
>> knowing what hasn't been encoded. Are the scripts on this list (
>> https://linguistics.berkeley.edu/sei/scripts-not-encoded.html)?
>>
>>
> Well, that's a long story. It is related to the history of adaptation of
> computers by the public at large in India. The really difficult part is not
> about scripts being encoded. Instead, it is about a script being
> over-encoded or encoded in a non-standard way. And the lack of adoption of
> standard encodings and input methods. Just to give one example, even though
> a single encoding (called ISCII) for all Brahmi-origin scripts of India was
> created officially, most people were unaware of it or didn't use it for so
> many reasons. One major reason being that it was not supported on Operating
> Systems, including Windows (which was anyway developed many years after
> creation of ISCII). Input methods and rendering engines for it were not
> available. You had to have a special terminal to use it, but that was text
> only terminal, used mainly in research centers and perhaps for some very
> limited official purposes. And computers, as far as the general public was
> concerned, were most commonly used for DeskTop Publishing (which became
> part of Indian languages as "DTP"). These non-standard encodings were
> mainly font-encodings, just to enable proper rendering of text for
> publishing. One of the most popular 'encodings' was based on the Remington
> typewriter for Hindi. Another was based on mostly phonetic mapping from
> Roman to Devanagari. Other languages which did not use Devanagari also had
> their own non-standard encodings, often multiple encodings. The reason
> these became popular was that they enabled people to type in Indian
> languages and see the text rendered properly, since no other option was
> available and they understandably didn't really care about it being
> standard or not. It wasn't until recently that Indic scripts were properly
> supported by any OS's. It is possible that even now, when Unicode is
> supported on most OS's and input methods are available as part of OS's,
> there are people still using non-standard encodings. Even now, you can come
> across problems related to either input methods or rendering for Indic
> scripts on OS's. And most importantly, there is still no universally
> accepted way to actually use these input methods. Most Indians, for
> example, on social media and even in emails type in some pseudo-phonetic
> way using Roman letters with the QWERTY keyboard. Typing in Indian
> languages using Indic scripts is still a specialized skill.
>
> The result of all this is that when you try to collect data for low
> resource languages, including major languages of India, there may be a lot
> of data -- or perhaps even all the data, depending on the language -- which
> is in some non-standard ad-hoc encoding which has non-trivial mapping with
> Unicode. This is difficult partly because non-standard encodings are often
> based on glyphs, rather than actual units of the script. So, to be able to
> use it you need a perfect encoding converter to get the text in Unicode
> (UTF-8). Such converters have been there for a long time, but since they
> were difficult to create, they were/are proprietary and not available even
> to researchers in most cases. It seems a pretty good OCR system has been
> developed for Indic scripts/languages, but I have not yet had the chance to
> try it.
>
> For example, I am currently (for the last few years) working on Bhojpuri,
> Magahi and Maithili. When we tried to collect data for these languages,
> there was the same problem, which is actually not really a problem for the
> general public because their purpose is served by these non-standard
> encodings, but for NLP/CL you face difficulty in getting the data in a
> usable form.
>
> This is just a brief overview and I also don't really know the full extent
> of it, in the sense that I don't have a comprehensive list of such
> non-standard encodings for all Indic scripts.
>
>
>> Re the unpublished paper (on a computational typology of writing
>> systems?):
>> when and to where (as in, which venues/publications) did you submit it?
>> I remember one of my first term papers from the 90s being on the
>> phonological system of written Cantonese (or sth like that --- don't
>> remember my wild days), the prof told me it wasn't "exactly linguistics"...
>>
>> I had submitted to the journal Written Language and Literacy in 2009. It
> was actually mostly my mistake that I didn't submit a revised version of
> the paper as I was going through a difficult period then.
>
>
>> Re "on building an encoding converter that will work for all 'encodings'
>> used for Indian languages":
>> this sounds interesting!
>>
>>
> Yes, I still sometimes wish I could build it.
>
>
>> Re "I too wish there was a good comprehensive history text encodings,
>> including non-standard ad-hoc encodings":
>> what do you mean by that --- history of text encodings or historical text
>> encodings?
>> After my discoveries from recent years, when my "mental model" towards
>> what's been practiced in the language space (esp. in CL/NLP) finally
>> *completely *shifted, I had wanted to host (or co-host) a tutorial on
>> character encoding for those who might be under-informed on the matter
>> (including but not limited to the "grammaroholics" (esp. the CL/NLP
>> practitioners who seem to be stuck doing grammar, even in the context of
>> computing) --- there are so many of them! :) )
>>
>>
> I mostly meant the non-standard 'encodings' (really just ad-hoc mappings)
> to serve someone's current purpose. To fully understand the situation, you
> have to be familiar with social-political-economic-etc. aspects of the
> language situation in India.
>
>
>> Re "word level language identification":
>> I don't do "words" anymore. In that 2016 TBLID paper of mine, I
>> (regrettably) was still going with the flow in under-reporting on
>> tokenization procedures (like what many "cool" ML papers did). But "words"
>> do certainly shape the results! I'm really forward to everyone working with
>> full-vocabulary, pure character or byte formats (depending on the task),
>> while being 100% aware of statistics. Things can be much more transparent
>> and easily replicable/reproducible that way anyway.
>>
>>
> Well, I used the word 'word' as just a shorthand for space separated
> segments. In my PhD thesis, I had also argued against word being the unit
> of computational processing or whatever you call it. I had called the unit
> Extra-Lexical Unit, consisting of a core morpheme and inflectional parts. I
> realize now that even that may not necessarily work for languages with
> highly fusional morphology. But, something like this is now the preferred
> unit of morphological processing, as in the CoNLL shared tasks and
> UniMorph. I also realize that I could not have been the first to come to
> this conclusion.
>
>
>> Re "We have to be tolerant of what you call bad research for various
>> unavoidable reasons. Research is not what it used to be":
>> No, I think one should just call out bad research and stop doing it. I
>> wouldn't want students to burn their midnight oil working hard for nothing.
>> Bad research warps also expectations and standards, in other sectors as
>> well (education, healthcare, commerce... etc.). Science, as in the pursuit
>> of truth and clarity, is and should be the number 1 priority of any decent
>> research. (In my opinion, market research or research for marketing
>> purposes should be all consolidated into one track/venue if they lack
>> scientific quality.) I agree research is not what it used to be --- but in
>> the sense that the quality is much worse in general, much hacking around
>> with minor, incremental improvements. Like in the case of "textual
>> computing", people are "grammar"-hacking.
>>
>>
> I completely agree with you, but ... Sometimes it is wise to be silent.
>
>
>> Re *better ... gender representation":
>> hhmm... I'm not so sure about that.
>>
>>
> You are a better judge of that. I just shared my opinion, which may not be
> completely free from bias, although I do try.
>
>
>> Re "About grammar, I have come to think of it as a kind of language model
>> for describing some linguistic phenomenon":
>> nah, grammar not necessary.
>>
>>
> I don't say it is necessary, but I see it as one possible model to
> describe language, which can be useful for some -- like educational --
> purposes if used in the right way.
>
>
>> Re grammaroholic reviewers:
>> yeah, there are tons of those in the CL/NLP space. I think many of them
>> are only willing and/or able to critique on grammar. Explicit is that it
>> shows that they don't want to check one's math and code --- besides, when
>> most work on "words" anyway, there is a limit to how things are
>> replicable/reproducible, esp. if on a different dataset. The implicit bit,
>> however, is that I think there is some latent intent to introduce/reinforce
>> the influence of "grammar" into the computing space. That, I do not agree
>> with at all.
>>
>>
> I should confess that I sometimes am guilty of that (pointing out
> grammatical mistakes) myself. However, the situation is complicated in
> countries like India due to historical and other reasons. I think the
> papers should at least be in a condition that they can be understood
> roughly as intended. This may not always be the case, particularly with
> non-native speakers of English, or people who are not yet speakers/writers
> of English at all. Now, perhaps no one knows better than myself that it is
> not really their fault completely, but as a reviewer, sometimes your
> patience is severely tested.
>
>
>> Re "magic":
>> yes, once one gets over the hype, it's just work.
>>
>>
> True, but what I said is based on where I am coming from (as in, to this
> position), which will take a really long time to explain. Of course, I
> don't literally mean magic.
>
> Re "I have no experience of field work at all and that I regret, but it is
>> partly because I am not a social creature":
>> one can be doing implicit and unofficial "fieldwork" everyday if one pays
>> attention to how language is used.
>>
>>
> That indeed I do all the time. I meant official fieldwork.
>