August 2023 - Corpora

3rd Artificial Intelligence for Industry, Science, and Society: Geneva, 11 to 15 September 2023
by Douglas Teodoro 09 Aug '23

09 Aug '23

Hello, We are pleased to announce that the University of Geneva will be hosting the 3rd Symposium on Artificial Intelligence for Industry, Science, and Society (AI2S2) from 11 to 15 September 2023, at Campus Biotech, Geneva. *Both in-person and remote-only registration are free*; registration closes soon, thus we encourage you to register ASAP if you intend to participate. We will be welcoming several renowned experts from each of the event pillars (industry, science, and society), and have an exciting schedule covering a wide variety of topics of interest to the community, including keynote presentations from the following experts: - *Alistair Knott*, Professor of AI; Victoria University of Wellington - *Andrew Wyckoff*, Former Director for Science, Technology, and Innovation; OECD - *Ger Janssen*, Principal Scientist for AI & Data Science & Digital Twin; Philips - *I**nma Martinez*, Chair of the Multi-Stakeholder Committee; GPAI - *Juha Heikkilä*, International Advisor for AI; European Commission - *Laure Soulier*, Professor HDR-ISIR Lab; Sorbonne University - *Maria Girone*, Head of OpenLab; CERN - *Michael Bronstein*, DeepMind Professor of AI; Oxford University - *Philippe Limantour*, Chief Technology and Cybersecurity Officer; Microsoft France - *Stephen MacFeely*, Director of Data and Analytics; World Health Organisation You can find more information about the event at the following links: - Main event website: https://ai2s2.org/2023/ - Event timetable/agenda: https://ai2s2.org/2023/event/ - Registration (in person): link <https://indico.cern.ch/event/1288077/page/29800-general-registration> - Registration (remote only): link <https://indico.cern.ch/event/1288077/page/29802-remote-registration> We hope to see many of you next month! Best regards, The AI2S2 2023 organisers

1 0

Call for Submission: ACM Multimedia 2023 Workshop On Generative AI And Multimodal LLM For Finance (GenAI4Finance)
by puneetm＠umd.edu 09 Aug '23

09 Aug '23

Hi all, Announcing the ACM Multimedia 2023 Workshop On Generative AI And Multimodal LLM For Finance (GenAI4Finance). Looking forward to receiving exciting submissions on the broad theme of algorithms and applications of Generative AI (Diffusion Models, GANS, LLMs, Transformers, ChatGPT, Opensource LLMs, etc) in Finance! Paper deadline: Aug 10 '23 Website: https://generativeai4finance-mm23.github.io/ Papers accepted at the conference will be published in ACM proceedings and presented at ACM MM in Ottawa, Canada. Remote as well as in-person presentations are allowed! Jointly organized by researchers from the University of Maryland Georgia Institute of Technology Adobe MBZUAI (Mohamed bin Zayed University of Artificial Intelligence) and The Philipp University of Marburg. CALL FOR PAPERS: About The GenAI4Finance Workshop Trading and investments in financial markets are useful mechanisms for wealth creation and poverty alleviation. Financial forecasting is an essential task that helps investors make sound investment decisions. It can assist users in better predicting price changes in stocks and currencies along with managing risk to reduce chances of adverse losses and bankruptcies. The world of finance is characterized by millions of people trading numerous assets in complex transactions, giving rise to an overwhelming amount of data that can indicate the future direction of stocks, equities, bonds, currencies, crypto coins, and non-fungible tokens (NFTs). Traditional statistical techniques have been extensively studied for stock market trading in developed economies. However, these techniques hold less relevance with rising automation in trading, the interconnected nature of the economies, and the advent of high-risk, high-reward cryptocurrencies and decentralized financial products like NFTs. Today, the world of finance has been brought ever closer to technology as technological innovations become a core part of financial systems. Hence, there is a need to systematically analyze how advances in Generative Artificial Intelligence and Large Language Modeling can help citizen investors, professional investment companies and novice enthusiasts better understand the financial markets. Relevance Of Multimodal LLMs And Generative AI In Finance The biggest hurdle in processing unstructured data through LLMs like ChatGPT, GPT-3.5, Multimodal GPT-4, PaLM, and others is their opaque nature that they learn from large internet-sized corpora. Lack of understanding of what, when, and how these models can be utilized for financial applications remains an unsolved problem. While these models have shown tremendous success in NLP and document processing tasks, their study on financial knowledge has been very limited. The difficulty in automatically capturing real-time stock market information by these LLMs impedes their successful application for traders and finance practitioners. The learning abilities of LLMs on streaming news sentiment expressed through different multimedia has been barely touched upon by researchers in finance. Artificial Intelligence based methods can help bridge that gap between the two domains through inter-sectional studies. Even though there have been scattered studies in understanding these challenges, there is a lack of a unified forum to focus on the umbrella theme of generative AI + multimodal language modeling for financial forecasting. We expect this workshop to attract researchers from various perspectives to foster research on the intersection of multimodal AI and financial forecasting. We also plan to organize one shared task to encourage practitioners and researchers to gain a deeper understanding of the problems and come up with novel solutions. Topics: This workshop will hold a research track and a shared task. The research track aims to explore recent advances and challenges of Multimodal Generative LLMs for finance. Researchers from artificial intelligence, computer vision, speech processing, natural language processing, data mining, statistics, optimization, and other fields are invited to submit papers on recent advances, resources, tools, and challenges on the broad theme of Multimodal AI for finance. The topics of the workshop include but are not limited to the following: Generative AI applications in finance LLM evaluation for finance Fine-tuning LLMs for finance Pre-training strategies for Financial LLMs Retrieval augmentations for interpretable finance Hallucinations in Financial LLMs and Generative AI Machine Learning for Time Series Data Audio-Video Processing for Financial Unstructured Data Processing text transcript with audio/video for audio-visual-textual alignment, information extraction, salient event detection, entity linking Conversational dialogue modeling using LLMs Natural Language Processing Applications in Finance Named-entity recognition, relationship extraction, ontology learning in financial documents Multimodal financial knowledge discovery Data construction, acquisition, augmentation, feature engineering, and analysis for investment and risk management Bias analysis and mitigation in financial LLMs and datasets Interpretability and explainability for financial Generative AI models and Multimodal LLMs Privacy-preserving AI for finance Video understanding of financial multimedia Vision + language and/or other modalities Time series modeling for financial applications Important Dates - June 1, 2023: ACM MM 2023 Workshop Website Open - August 10, 2023: Submission Deadline - August 15, 2023: Paper notification - August 18, 2023: Camera-Ready Deadline All deadlines are “anywhere on earth” (UTC-12) Submission: Authors are invited to submit their unpublished work that represents novel research. The papers should be written in English using the ACM Multimedia-23 author kit and follow the ACM Multimedia 2023 formatting guidelines. Authors can also submit supplementary materials, including technical appendices, source codes, datasets, and multimedia appendices. All submissions, including the main paper and its supplementary materials, should be fully anonymized. For more information on formatting and anonymity guidelines, please refer to ACM Multimedia 2023 call for papers page. Reviewing: At least two reviewers with matching technical expertise will review each paper. Each paper should be accompanied by one workshop registration. Multiple submissions of the same paper to more ACM Multimedia-2023 workshops are forbidden. All papers will be double-blind peer-reviewed. The workshop accepts both long papers and short papers: Short Paper: Up to 4 pages of content, plus unlimited pages for references and appendix. Upon acceptance, the authors are provided with 1 more page to address the reviewer's comments. Long Paper: Up to 8 pages of content, plus unlimited pages for references and appendix. Upon acceptance, the authors are provided with 1 more page to address the reviewer's comments. Two reviewers with the same technical expertise will review each paper. Authors of the accepted papers will present their work in either the Oral or Poster session. All accepted papers will appear in the workshop proceedings to be published in ACM Multimedia 2023 proceedings. The authors will keep the copyright of their published papers. The paper must be submitted using EasyChair (https://easychair.org/my/conference?conf=genai4finance).

1 0

[JOBS] PhD and permanent Postdoc positions in Argumentation
by Jacky Visser (Staff) 08 Aug '23

08 Aug '23

In addition to advertising a permanent postdoctoral research position (see arg.tech/2023PDRA01<http://arg.tech/2023PDRA01>, deadline 14 August 2023) the Centre for Argument Technology is now also opening a call for applications for fully-funded PhD studentships. Applications are invited across the full range of interdisciplinary research in the Centre, with more details available at arg.tech/2023PHD01<http://arg.tech/2023PHD01>. Closing date for the PhD studentships is 23 September 2023. The University of Dundee is a registered Scottish Charity, No: SC015096

1 0

math and language/text data (continued from thread "Re: Re: Any literature about tensors-based corpora NLP research with actual examples (and homework ;-)) you would suggest? ...")
by Ada Wan 08 Aug '23

08 Aug '23

[resending since it looked like my reply bounced due to size of thread] ---------- Forwarded message --------- From: Ada Wan <adawan919(a)gmail.com> Date: Mon, Aug 7, 2023 at 3:46 PM Subject: Re: [Corpora-List] Re: Any literature about tensors-based corpora NLP research with actual examples (and homework ;-)) you would suggest? ... To: Anil Singh <anil.phdcl(a)gmail.com> Cc: Hesham Haroon <heshamharoon19(a)gmail.com>, corpora < corpora(a)list.elra.info> Hi Anil Just a couple of quick points for now: 1. re encoding etc.: I know how involved and messy this "encoding" and "clean-up" issues are/were and could be for the non-ASCII world. I wish people who work on multilngual NLP and/or language technologies would give these issues more attention. [Iiuc, these intros cover the standardized bit from Unicode (also good for other viewers at home, I mean, readers of this thread :) ): https://www.unicode.org/notes/tn10/indic-overview.pdf https://www.unicode.org/faq/indic.html] Re "there is still no universally accepted way to actually use these input methods" and "Most Indians, for example, on social media and even in emails type in some pseudo-phonetic way using Roman letters with the QWERTY keyboard": actually this is true in many other parts of the non-ASCII world. Even in DE and FR, one has "gotten used to" circumventing the ASCII-conventions with alternate spellings, e.g. spelling out umlauts, dropping the accents.... Between i) getting people to just use script the "standardized way" and ii) weaning people from wordhood --- I wonder which would be harder. :) But I regard i) as a vivid showcase of human creativity, I don't think there is anything to "correct". As with ii), it could be interpreted as baggage accompanying "ASCII-centrism". Language is (re-)productive in nature anyway, there will always be novel elements or ways of expression that could be ahead of standardization. I don't think we should dictate how people write or express themselves digitally (by asking them to adhere to standards or grammar), but, instead, we (as technologists) should use good methods and technologies to deal with the data that we have. (N.B. one doesn't write in "words". One just writes. "Word count" is arbitrary, but character count is not --- unless one is in charge of the lower-level/technical design of such.) For text-based analyses with "non-standard varieties or input idiosyncrasies", I suppose one would just have to have good methods in order to extract information from these (if so wished). (Btw, iiuc, you are also of the opinion that there isn't much/anything else to work on as far as character encoding for scripts usually used in India or standardization on this front is concerned, correct? If not, please let it be known here. I bet/hope some Unicode folks might be on this thread?) 2. Re word and morphology: I suppose what I am advocating is that not only is there really no word (whatever that means etc.), there is also no morphology --- not as something intrinsic in language nor as a science. I understand that this can be hard to accept for those of us who have had a background in Linguistics or many language-related areas, because morphology has been part of an academic discipline for a while. But the point is that morphological analyses are really just some combination of post-hoc analyses based on traditional philological practices and preferences (i.e. ones with selection bias). It is not a universal decomposition method. 3. Re "but ... Sometimes it is wise to be silent": I'm afraid I'd have to disagree with you here. Before my experiences these past years, I had never wanted to be "famous"/"infamous". I had just wanted to work on my research. But in the course of these few difficult years, I realized part of the "complexity" lied in something that might have had made my whole life and my "(un-)career"/"atypical career path" difficult --- our language interpretation and our community's/communities' attitudes towards language (again, whatever "language" means!). 4. Re grammar: much of my "'no grammar' campaign" comes from my observation that many people seem to have lost it with language. The "word"/"grammar"-hacking phenomena, esp. in Linguistics/CL/NLP, exacerbate, otoh, our dependency on grammar, otoh, that on words --- when neither of these is real or necessary. And then the more I see these people advertising their work on social media, the more likely it is for them to misinform and influence others (may these be the general public or computer scientists who do not have much of a background in language or in the science of language), warping the their relationship with language. (The results of my resolving "morphological complexity" in "Fairness in Representation" could have been hard to experience for some.) Re "I don't say it is necessary, but I see it as one possible model to describe language": Ok, to this "grammar is useful" argument: esp. considering how grammar and linguistics (as a subject) both suffered from selection bias, and both have some intrinsic problems in their approach with "judgments", even if such models might work sometimes in describing some data, I don't think it'd be ethical to continue pursuing/developing grammatical methods. When dealing with texts, character n-grams (based on actual data!!)** and their probabilities should suffice as a good description --- insofar as one actually provides a description. There is no need to call it "grammar". It just invokes tons of unnecessary dependencies. **And when over time and across datasets, one is able to generalize truths (may these be for language/communication/information/computation/mathematics/...) from these statistical accounts that hold, then there is a potential for good theories. But as of now, even in Linguistics, there is still a tendency to cherry-pick, in e.g. discarding/disregarding whitespaces. And esp. considering how little data (as in, so few good/verified datasets) we have in general, there is still quite some work to do for all when it comes to good data scientific theories. Re "which can be useful for some -- like educational -- purposes if used in the right way": how grammar has been too heavily used in education is more of an ailment than a remedy or healthy direction. I think the only way for grammar to survive is to regard it as "style guides". Re "particularly with non-native speakers of English ... sometimes your patience is severely tested": to that: yes, I understand/empathize, but no --- in that we ought to change how writing/language is perceived. Depending on the venue/publication, I think sometimes we ought to relax a bit with other's stylistic performance. The "non-native speakers of X" has been a plague in Linguistics for a while now. We have almost paved our way to some next-gen eugenics with that. Most of us on this mailing list are not submitting to non-scientific literary publications, I'd prefer better science and content to better writing styles at any time. Re "magic": I think sometimes I do get some "magical" results with higher-dimensional models. So there is some "magic" (or ((math/truths)+data)? :) ) that is not so obvious. That, I am always glad to look further into. Best Ada On Sat, Aug 5, 2023 at 6:26 PM Anil Singh <anil.phdcl(a)gmail.com> wrote: > On Sat, Aug 5, 2023 at 6:56 PM Ada Wan <adawan919(a)gmail.com> wrote: > >> Hi Anil >> >> Thanks for your comments. (And thanks for reading my work.) >> >> Yeah, there is a lot that one has to pay attention to when it comes to >> what "textual computing" entails (and to which extent it "exists"). Beyond >> "grammar" definitely. But experienced CL folks should know that. (Is this >> you btw: https://scholar.google.com/citations?user=QKnpUbgAAAAJ? >> > > Yes, that's me, for the better or for the worse. > > >> If not, do you have a webpage for your work? Nice to e-meet you either >> way!) >> >> > Thank you. > > Re "I know first hand the problems in doing NLP for low resource languages >> which are related to text encodings": >> which specific languages/varieties are you referring to here? If the >> issue lies in the script not having been encoded, one can contact SEI about >> it (https://linguistics.berkeley.edu/sei/)? I'm always interested in >> knowing what hasn't been encoded. Are the scripts on this list ( >> https://linguistics.berkeley.edu/sei/scripts-not-encoded.html)? >> >> > Well, that's a long story. It is related to the history of adaptation of > computers by the public at large in India. The really difficult part is not > about scripts being encoded. Instead, it is about a script being > over-encoded or encoded in a non-standard way. And the lack of adoption of > standard encodings and input methods. Just to give one example, even though > a single encoding (called ISCII) for all Brahmi-origin scripts of India was > created officially, most people were unaware of it or didn't use it for so > many reasons. One major reason being that it was not supported on Operating > Systems, including Windows (which was anyway developed many years after > creation of ISCII). Input methods and rendering engines for it were not > available. You had to have a special terminal to use it, but that was text > only terminal, used mainly in research centers and perhaps for some very > limited official purposes. And computers, as far as the general public was > concerned, were most commonly used for DeskTop Publishing (which became > part of Indian languages as "DTP"). These non-standard encodings were > mainly font-encodings, just to enable proper rendering of text for > publishing. One of the most popular 'encodings' was based on the Remington > typewriter for Hindi. Another was based on mostly phonetic mapping from > Roman to Devanagari. Other languages which did not use Devanagari also had > their own non-standard encodings, often multiple encodings. The reason > these became popular was that they enabled people to type in Indian > languages and see the text rendered properly, since no other option was > available and they understandably didn't really care about it being > standard or not. It wasn't until recently that Indic scripts were properly > supported by any OS's. It is possible that even now, when Unicode is > supported on most OS's and input methods are available as part of OS's, > there are people still using non-standard encodings. Even now, you can come > across problems related to either input methods or rendering for Indic > scripts on OS's. And most importantly, there is still no universally > accepted way to actually use these input methods. Most Indians, for > example, on social media and even in emails type in some pseudo-phonetic > way using Roman letters with the QWERTY keyboard. Typing in Indian > languages using Indic scripts is still a specialized skill. > > The result of all this is that when you try to collect data for low > resource languages, including major languages of India, there may be a lot > of data -- or perhaps even all the data, depending on the language -- which > is in some non-standard ad-hoc encoding which has non-trivial mapping with > Unicode. This is difficult partly because non-standard encodings are often > based on glyphs, rather than actual units of the script. So, to be able to > use it you need a perfect encoding converter to get the text in Unicode > (UTF-8). Such converters have been there for a long time, but since they > were difficult to create, they were/are proprietary and not available even > to researchers in most cases. It seems a pretty good OCR system has been > developed for Indic scripts/languages, but I have not yet had the chance to > try it. > > For example, I am currently (for the last few years) working on Bhojpuri, > Magahi and Maithili. When we tried to collect data for these languages, > there was the same problem, which is actually not really a problem for the > general public because their purpose is served by these non-standard > encodings, but for NLP/CL you face difficulty in getting the data in a > usable form. > > This is just a brief overview and I also don't really know the full extent > of it, in the sense that I don't have a comprehensive list of such > non-standard encodings for all Indic scripts. > > >> Re the unpublished paper (on a computational typology of writing >> systems?): >> when and to where (as in, which venues/publications) did you submit it? >> I remember one of my first term papers from the 90s being on the >> phonological system of written Cantonese (or sth like that --- don't >> remember my wild days), the prof told me it wasn't "exactly linguistics"... >> >> I had submitted to the journal Written Language and Literacy in 2009. It > was actually mostly my mistake that I didn't submit a revised version of > the paper as I was going through a difficult period then. > > >> Re "on building an encoding converter that will work for all 'encodings' >> used for Indian languages": >> this sounds interesting! >> >> > Yes, I still sometimes wish I could build it. > > >> Re "I too wish there was a good comprehensive history text encodings, >> including non-standard ad-hoc encodings": >> what do you mean by that --- history of text encodings or historical text >> encodings? >> After my discoveries from recent years, when my "mental model" towards >> what's been practiced in the language space (esp. in CL/NLP) finally >> *completely *shifted, I had wanted to host (or co-host) a tutorial on >> character encoding for those who might be under-informed on the matter >> (including but not limited to the "grammaroholics" (esp. the CL/NLP >> practitioners who seem to be stuck doing grammar, even in the context of >> computing) --- there are so many of them! :) ) >> >> > I mostly meant the non-standard 'encodings' (really just ad-hoc mappings) > to serve someone's current purpose. To fully understand the situation, you > have to be familiar with social-political-economic-etc. aspects of the > language situation in India. > > >> Re "word level language identification": >> I don't do "words" anymore. In that 2016 TBLID paper of mine, I >> (regrettably) was still going with the flow in under-reporting on >> tokenization procedures (like what many "cool" ML papers did). But "words" >> do certainly shape the results! I'm really forward to everyone working with >> full-vocabulary, pure character or byte formats (depending on the task), >> while being 100% aware of statistics. Things can be much more transparent >> and easily replicable/reproducible that way anyway. >> >> > Well, I used the word 'word' as just a shorthand for space separated > segments. In my PhD thesis, I had also argued against word being the unit > of computational processing or whatever you call it. I had called the unit > Extra-Lexical Unit, consisting of a core morpheme and inflectional parts. I > realize now that even that may not necessarily work for languages with > highly fusional morphology. But, something like this is now the preferred > unit of morphological processing, as in the CoNLL shared tasks and > UniMorph. I also realize that I could not have been the first to come to > this conclusion. > > >> Re "We have to be tolerant of what you call bad research for various >> unavoidable reasons. Research is not what it used to be": >> No, I think one should just call out bad research and stop doing it. I >> wouldn't want students to burn their midnight oil working hard for nothing. >> Bad research warps also expectations and standards, in other sectors as >> well (education, healthcare, commerce... etc.). Science, as in the pursuit >> of truth and clarity, is and should be the number 1 priority of any decent >> research. (In my opinion, market research or research for marketing >> purposes should be all consolidated into one track/venue if they lack >> scientific quality.) I agree research is not what it used to be --- but in >> the sense that the quality is much worse in general, much hacking around >> with minor, incremental improvements. Like in the case of "textual >> computing", people are "grammar"-hacking. >> >> > I completely agree with you, but ... Sometimes it is wise to be silent. > > >> Re *better ... gender representation": >> hhmm... I'm not so sure about that. >> >> > You are a better judge of that. I just shared my opinion, which may not be > completely free from bias, although I do try. > > >> Re "About grammar, I have come to think of it as a kind of language model >> for describing some linguistic phenomenon": >> nah, grammar not necessary. >> >> > I don't say it is necessary, but I see it as one possible model to > describe language, which can be useful for some -- like educational -- > purposes if used in the right way. > > >> Re grammaroholic reviewers: >> yeah, there are tons of those in the CL/NLP space. I think many of them >> are only willing and/or able to critique on grammar. Explicit is that it >> shows that they don't want to check one's math and code --- besides, when >> most work on "words" anyway, there is a limit to how things are >> replicable/reproducible, esp. if on a different dataset. The implicit bit, >> however, is that I think there is some latent intent to introduce/reinforce >> the influence of "grammar" into the computing space. That, I do not agree >> with at all. >> >> > I should confess that I sometimes am guilty of that (pointing out > grammatical mistakes) myself. However, the situation is complicated in > countries like India due to historical and other reasons. I think the > papers should at least be in a condition that they can be understood > roughly as intended. This may not always be the case, particularly with > non-native speakers of English, or people who are not yet speakers/writers > of English at all. Now, perhaps no one knows better than myself that it is > not really their fault completely, but as a reviewer, sometimes your > patience is severely tested. > > >> Re "magic": >> yes, once one gets over the hype, it's just work. >> >> > True, but what I said is based on where I am coming from (as in, to this > position), which will take a really long time to explain. Of course, I > don't literally mean magic. > > Re "I have no experience of field work at all and that I regret, but it is >> partly because I am not a social creature": >> one can be doing implicit and unofficial "fieldwork" everyday if one pays >> attention to how language is used. >> >> > That indeed I do all the time. I meant official fieldwork. >

1 1

Call For Participation: Machine Translation for Indian Languages (MTIL)
by Surupendu Gangopadhyay 08 Aug '23

08 Aug '23

Apologies for multiple posting *********************************** *------------------------------------------------------------------------------------------------Machine Translation for Indian Languages (MTIL) 2023------------------------------------------------------------------------------------------------* We invite all IR and NLP researchers and enthusiasts to participate in the MTIL track (https://mtilfire.github.io/mtil/2023/) held in conjunction with the Forum for Information Retrieval Evaluation (FIRE) 2023 ( http://fire.irsi.res.in/). Indian languages have many linguistic complexities. Though some Indian languages share syntactic similarities, some possess intricate morphological structures. At the same time, some Indian languages are low-resource. Therefore the machine translation models should address these unique challenges in translating between Indian languages. The MTIL track consists of two tasks: 1. *General Translation Task (Task 1):* Task participants should build a machine translation model to translate sentences of the following language pairs: 1. Hindi-Gujarati 2. Hindi-Kannada 3. Kannada-Hindi 4. Hindi-Odia 5. Odia-Hindi 6. Hindi-Punjabi 7. Punjabi-Hindi 8. Hindi-Sindhi 9. Urdu-Kashmiri 10. Telugu-Hindi 11. Hindi-Telugu 12. Urdu-Hindi 13. Hindi-Urdu 2. *Domain Specific Translation Task (Task 2)*: Task participants will build machine translation models for Governance and Healthcare domains. 1. Healthcare: a. Hindi-Gujarati b. Kannada-Hindi c. Hindi-Odia d. Odia-Hindi e. Hindi-Punjabi f. Kannada-Hindi 2. Governance: a. Hindi-Gujarati b. Kannada-Hindi c. Hindi-Odia d. Odia-Hindi e. Hindi-Punjabi f. Kannada-Hindi *Dataset:* The primary source of parallel language pairs is Bharat Parallel Corpus Collection (BPCC), released by AI4Bharat (https://ai4bharat.iitm.ac.in/bpcc ). Participants are encouraged to add datasets of their choice, including parallel corpora and monolingual datasets, to train their models. More information on registration and participation in the track can be found here: https://mtilfire.github.io/mtil/2023/ This track is being done in association with BHASHINI ( https://bhashini.gov.in/) *Organisers* - Prasenjit Majumder, DAIICT Gandhinagar,India and TCG CREST, Kolkata,India - Arafat Ahsan, IIIT-Hyderabad,India - Asif Ekbal, IIT-Patna,India - Saran Pandian, DAIICT Gandhinagar,India - Ramakrishna Appicharla, IIT-Patna ,India - Surupendu Gangopadhyay, DAIICT Gandhinagar,India - Ganesh Epili, DAIICT Gandhinagar,India - Dreamy Pujara, DAIICT Gandhinagar,India - Misha Patel, DAIICT Gandhinagar,India - Aayushi Patel, DAIICT Gandhinagar,India - Bhargav Dave, DAIICT Gandhinagar,India - Mukesh Jha, DAIICT Gandhinagar,India

1 0

CfP: LREC-COLING 2024 (Torino)
by Enrico Santus 07 Aug '23

07 Aug '23

Call for Papers [https://lrec-coling-2024.org/2nd-call-for-papers/] Two international key players in the area of computational linguistics, the ELRA Language Resources Association (ELRA) and the International Committee on Computational Linguistics (ICCL), are joining forces to organize the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024) to be held in Torino, Italy on 20-25 May, 2024. IMPORTANT DATES (All deadlines are 11:59PM UTC-12:00 (“anywhere on Earth”) - 22 September 2023: Paper anonymity period starts - 13 October 2023: Final submissions due (long, short and position papers) - 13 October 2023: Workshop/Tutorial proposal submissions due - 22–29 January 2024: Author rebuttal period - 5 February 2024: Final reviewing - 19 February 2024: Notification of acceptance - 25 March 2024: Camera-ready due - 20-25 May 2024: LREC-COLING2024 conference SUBMISSION TOPICS LREC-COLING 2024 invites the submission of long and short papers featuring substantial, original, and unpublished research in all aspects of natural language and computation, language resources (LRs) and evaluation, including spoken and sign language and multimodal interaction. Submissions are invited in five broad categories: (i) theories, algorithms, and models, (ii) NLP applications, (iii) language resources, (iv) NLP evaluation and (v) topics of general interest. Submissions that span multiple categories are particularly welcome. (I) Theories, algorithms, and models - Discourse and Pragmatics - Explainability and Interpretability of Large Language Models - Language Modeling - CL/NLP and Linguistic Theories - CL/NLP for Cognitive Modeling and Psycholinguistics - Machine Learning for CL/NLP - Morphology and Word Segmentation - Semantics - Tagging, Chunking, Syntax and Parsing - Textual Inference (II) NLP applications - Applications (including BioNLP and eHealth, NLP for legal purposes, NLP for Social Media and Journalism, etc.) - Dialogue and Interactive Systems - Document Classification, Topic Modeling, Information Retrieval and Cross-Lingual Retrieval - Information Extraction, Text Mining, and Knowledge Graph Derivation from Texts - Machine Translation for Spoken/Written/Sign Languages, and Translation Aids - Sentiment Analysis, Opinion and Argument Mining - Speech Recognition/Synthesis and Spoken Language Understanding - Natural Language Generation, Summarization and Simplification - Question Answering - Offensive Speech Detection and Analysis - Vision, Robotics, Multimodal and Grounded Language Acquisition (III) Language resource design, creation, and use: text, speech, sign, gesture, image, in single or multimodal/multimedia data - Guidelines, standards, best practices and models for LRs, interoperability - Methodologies and tools for LRs construction, annotation, and acquisition - Ontologies, terminology and knowledge representation - LRs and Semantic Web (including Linked Data, Knowledge Graphs, etc.) - LRs and Crowdsourcing - Metadata for LRs and semantic/content mark-up - LRs in systems and applications such as information extraction, information retrieval, audio-visual and multimedia search, speech dictation, meeting transcription, Computer-Aided Language Learning, training and education, mobile communication, machine translation, speech translation, summarisation, semantic search, text mining, inferencing, reasoning, sentiment analysis/opinion mining, (speech-based) dialogue systems, natural language and multimodal/multisensory interactions, chatbots, voice-activated services, etc. - Use of (multilingual) LRs in various fields of application like e-government, e-participation, e-culture, e-health, mobile applications, digital humanities, social sciences, etc. - LRs in the age of deep neural networks - Open, linked and shared data and tools, open and collaborative architectures - Bias in language resources - User needs, LT for accessibility (IV) NLP evaluation methodologies - NLP evaluation methodologies, protocols and measures - Benchmarking of systems and products - Evaluation metrics in Machine Learning - Usability evaluation of HLT-based user interfaces and dialogue systems - User satisfaction evaluation (V) Topics of general interest - Multilingual issues, language coverage and diversity, less-resourced languages - Replicability and reproducibility issues - Organisational, economical, ethical and legal issues - Priorities, perspectives, strategies in national and international policies - International and national activities, projects and initiatives PAPER THEME TRACKS Those topics are organized into 26 main tracks: - LC01 Applications Involving LRs and Evaluation (including Applications in Specific Domains) - LC02 CL and Linguistic Theories, Cognitive Modeling and Psycholinguistics - LC03 Corpora and Annotation (including Tools, Systems, Treebanks) - LC04 Dialogue, Conversational Systems, Chatbots, Human-Robot Interaction - LC05 Digital Humanities and Cultural Heritage - LC06 Discourse and Pragmatics - LC07 Document Classification, Information Retrieval and Cross-lingual Retrieval - LC08 Evaluation and Validation Methodologies - LC09 Inference, Reasoning, Question Answering - LC10 Information Extraction, Knowledge Extraction, and Text Mining - LC11 Integrated Systems and Applications - LC12 Knowledge Discovery/Representation (including Knowledge Graphs, Linked Data, Terminology, Ontologies) - LC13 Language Modeling - LC14 Less-Resourced/Endangered/Less-studied Languages - LC15 Lexicon and Semantics - LC16 Machine Learning Models and Techniques for CL/NLP - LC17 Multilinguality, Machine Translation, and Translation Aids (including Speech-to-Speech Translation) - LC18 Multimodality, Cross-modality (including Sign Languages, Vision and Other Modalities), Multimodal Applications, Grounded Language Acquisition, and HRI - LC19 Natural Language Generation, Summarization and Simplification - LC20 Offensive and Non-inclusive Language Detection and Analysis - LC21 Opinion & Argument Mining, Sentiment Analysis, Emotion Recognition/Generation - LC22 Parsing, Tagging, Chunking, Grammar, Syntax, Morphosyntax, Morphology - LC23 Policy issues, Ethics, Legal Issues, Bias Analysis (including Language Resource Infrastructures, Standards for LRs, Metadata) - LC24 Social Media Processing - LC25 Speech Resources and Processing (including Phonetic Databases, Phonology, Prosody, Speech Recognition, Synthesis and Spoken Language Understanding) - LC26 Trustworthiness, Interpretability, and Explainability of Neural Models PAPER TYPES AND FORMATS LREC-COLING 2024 invites high-quality submissions written in English. Submissions of three forms of papers will be considered: 1. Regular long papers – up to eight (8) pages maximum*, presenting substantial, original, completed, and unpublished work. 2. Short papers – up to four (4) pages*, describing a small focused contribution, negative results, system demonstrations, etc. 3. Position papers – up to eight (8) pages*, discussing key hot topics, challenges and open issues, as well as cross-fertilization between computational linguistics and other disciplines. * Excluding any number of additional pages for references, ethical consideration, conflict-of-interest, as well as data, and code availability statements. Upon acceptance, final versions of long papers will be given one additional page – up to nine (9) pages of content plus unlimited pages for acknowledgments and references – so that reviewers’ comments can be taken into account. Final versions of short papers may have up to five (5) pages, plus unlimited pages for acknowledgments and references. For both long and short papers, all figures and tables that are part of the main text must fit within these page limits. Furthermore, appendices or supplementary material will also be allowed ONLY in the final, camera-ready version, but not during submission, as papers should be reviewed without the need to refer to any supplementary materials. Linguistic examples, if any, should be presented in the original language but also glossed into English to allow accessibility for a broader audience. Note that paper types are decisions made orthogonal to the eventual, final form of presentation (i.e., oral versus poster). PAPER SUBMISSIONS AND TEMPLATES Submission is electronic, using the Softconf START conference management system via the link: https://softconf.com/lrec-coling2024/papers/ Both long and short papers must follow the LREC-COLING 2024 two-column format, using the supplied official style files. The templates can be downloaded from the Style Files and Formatting page provided on the website. Please do not modify these style files, nor should you use templates designed for other conferences. Submissions that do not conform to the required styles, including paper size, margin width, and font size restrictions, will be rejected without review. AUTHOR RESPONSIBILITIES Papers must be of original, previously-unpublished work. Papers must be anonymized to support double-blind reviewing. Submissions thus must not include authors’ names and affiliations. The submissions should also avoid links to non-anonymized repositories: the code should be either submitted as supplementary material in the final version of the paper, or as a link to an anonymized repository (e.g., Anonymous GitHub <https://anonymous.4open.science/> or Anonym Share <https://anonymfile.com/>). Papers that do not conform to these requirements will be rejected without review. If the paper is available as a preprint, this must be indicated on the submission form but not in the paper itself. In addition, LREC-COLING 2024 will follow the same policy as ACL conferences establishing an anonymity period during which non-anonymous posting of preprints is not allowed. More specifically, direct submissions to LREC-COLING 2024 may not be made available online (e.g. via a preprint server) in a non-anonymized form after September 22, 11:59PM UTC-12:00 (for arXiv, note that this refers to submission time). Also included in that policy are instructions to reviewers to not rate papers down for not citing recent preprints. Authors are asked to cite published versions of papers instead of preprint versions when possible. Papers that have been or will be under consideration for other venues at the same time must be declared at submission time. If a paper is accepted for publication at LREC-COLING 2024, it must be immediately withdrawn from other venues. If a paper under review at LREC-COLING 2024 is accepted elsewhere and authors intend to proceed there, the LREC-COLING 2024 committee must be notified immediately ETHICS STATEMENT We encourage all authors submitting to LREC-COLING 2024 to include an explicit ethics statement on the broader impact of their work, or other ethical considerations after the conclusion but before the references. The ethics statement will not count toward the page limit (8 pages for long, 4 pages for short papers). PRESENTATION REQUIREMENT All papers accepted to the main conference track must be presented at the conference to appear in the proceedings, and at least one author must register for LREC-COLING2024. Papers will be presented either orally or as posters. The specific presentation modality of a paper will be decided based on its content, with no difference in quality implied. Papers that include a demonstration component will be presented as posters. All papers accepted to the main conference will be required to submit a presentation video. The conference will be hybrid, with an emphasis on encouraging interaction between the online and in-person modalities, and thus presentations can be either on-site or virtual. -- Enrico Santus, PhD *Head of Human Computation* *CTO Office at Bloomberg LP* *Website*: www.esantus.com *E-mail*: esantus(a)gmail.com *All opinions expressed in my private e-mails are my own, and they do not represent any group, institution or company to which I am associated.*

1 0

PhD position on efficient E2E models for Automatic Speech Recognition
by Daniele Falavigna 07 Aug '23

07 Aug '23

--- Apologies for multiple posting --- The Speech Technology group (SpeechTek) at Fondazione Bruno Kessler <https://www.fbk.eu/en/> (Trento, Italy) in conjunction with the ICT International Doctorate School of the University of Trento <https://iecs.unitn.it/> is pleased to announce the availability of the following fully-funded PhD position: *TITLE*: Efficient E2E models for automatic speech recognition in multi-speaker scenarios *DESCRIPTION*: In spite of the recent progress in speech technologies, processing and understanding conversational spontaneous speech is still an open issue, in particular in presence of challenging acoustic conditions as those posed by dinner party scenarios. Although enormous progresses have been made recently in a variety of speech processing tasks (such as speech enhancement, speech separation, speech recognition, spoken language understanding), targeting also multi-speaker speech recognition, a unified established solution is still far from being available. Moreover, the computational complexity of current approaches is extremely high, making an actual deployment in low-end or IoT devices not feasible in practice. The candidate will advance the current state-of-the-art in speech processing (in particular for separation, enhancement and recognition) towards developing a unified solution, possibly based on self-supervised or unsupervised approaches, for automatic speech recognition in dinner party scenarios, as those considered in the CHiME challenges ( https://arxiv.org/abs/2306.13734). *CONTACTS*: brutti(a)fbk.eu <guerini(a)fbk.eu> *COMPLETE DETAILS AVAILABLE AT*: https://iecs.unitn.it/education/admission/reserved-topic-scholarships#A9 ------------------------------------------------------------------------------------------------ Giuseppe Daniele Falavigna Fondazione Bruno Kessler Via Sommarive 18 - 38123 Povo - Trento, Italy mail:falavi@fbk.eu - tel:+39(0)461314562 - fax:+39(0)461314591 HomePage: https://speechtek.fbk.eu/people/profile/falavi ------------------------------------------------------------------------------------------------- -- -- Le informazioni contenute nella presente comunicazione sono di natura privata e come tali sono da considerarsi riservate ed indirizzate esclusivamente ai destinatari indicati e per le finalità strettamente legate al relativo contenuto. Se avete ricevuto questo messaggio per errore, vi preghiamo di eliminarlo e di inviare una comunicazione all’indirizzo e-mail del mittente. -- The information transmitted is intended only for the person or entity to which it is addressed and may contain confidential and/or privileged material. If you received this in error, please contact the sender and delete the material.

1 0

Shared Task: Prompting Large Language Models as Explainable Metrics
by Juri Opitz 07 Aug '23

07 Aug '23

Dear colleagues, you are invited to participate in the Eval4NLP 2023 shared task on **Prompting Large Language Models as Explainable Metrics**. Please find more information below and on the shared task webpage: https://eval4nlp.github.io/2023/shared-task.html Important Dates - Shared task announcement: August 02, 2023 - Dev phase: August 07, 2023 - Test phase: September 18, 2023 - System Submission Deadline: September 23, 2023 - System paper submission deadline: October 5, 2023 - System paper camera ready submission deadline: October 12, 2023 All deadlines are 11.59 pm UTC -12h (“Anywhere on Earth”). The timeframe of the test phase may change. Please regularly check the shared task webpage: https://eval4nlp.github.io/2023/shared-task.html. ** Overview ** With groundbreaking innovations in unsupervised learning and scalable architectures the opportunities (but also risks) of automatically generating audio, images, video and text, seem overwhelming. Human evaluations of this content are costly and are often infeasible to collect. Thus, the need for automatic metrics that reliably judge the quality of generation systems and their outputs, is stronger than ever. Current state-of-the-art metrics for natural language generation (NLG) still do not match the performance of human experts. They are mostly based on black-box language models and usually return a single quality score (sentence-level), making it difficult to explain their internal decision process and their outputs. The release of APIs to large language models (LLMs), like ChatGPT and the recent open-source availability of LLMs like LLaMA has led to a boost of research in NLP, including LLM-based metrics. Metrics like GEMBA [*] explore the prompting of ChatGPT and GPT4 to directly leverage them as metrics. Instructscore [*] goes in a different direction and finetunes a LLaMA model to predict a fine grained error diagnosis of machine translated content. We notice that current work (1) does not systematically evaluate the vast amount of possible prompts and prompting techniques for metric usage, including, for example, approaches that explain a task to a model or let the model explain a task itself, and (2) rarely evaluates the performance of recent open-source LLMs, while their usage is incredibly important to improve the reproducibility of metric research, compared to closed-source metrics. This year’s Eval4NLP shared task, combines these two aspects. We provide a selection of open-source, pre-trained LLMs. The task is to develop strategies to extract scores from these LLM’s that grade machine translations and summaries. We will specifically focus on prompting techniques, therefore, fine-tuning of the LLM’s is not allowed. Based on the submissions, we hope to explore and formalize prompting approaches for open-source LLM-based metrics and, with that, help to improve their correlation to human judgements. As many prompting techniques produce explanations as a side product we hope that this task will also lead to more explainable metrics. Also, we want to evaluate which of the selected open-source models provide the best capabilities as metrics, thus, as a base for fine-tuning. ** Goals ** The shared task has the following goals: Prompting strategies for LLM-based metrics: We want to explore which prompting strategies perform best for LLM-based metrics. E.g., few-shot prompting [*], where examples of other solutions are given in a prompt, chain-of-thought reasoning (CoT) [*], where the model is prompted to provide a multi-step explanation itself, or tree-of-thought prompting [*], where different explanation paths are considered, and the best is chosen. Also, automatic prompt generation might be considered [*]. Numerous other recent works explore further prompting strategies, some of which use multiple evaluation passes. Score aggregation for LLM-based metrics: We also want to explore which strategies best aggregate the model scores from LLM-based metrics. E.g., scores might be extracted as the probability of a paraphrase being created [*], or they could be extracted from LLM output directly [*]. Explainability for LLM-based metrics: We want to analyze whether the metrics that provide the best explanations (for example with CoT) will achieve the highest correlation to human judgements. We assume that this is the case, due to the human judgements being based on fine-grained evaluations themselves (e.g. MQM for machine translation) ** Task Description ** The task will consist of building a reference-free metric for machine translation and/or summarization that predicts sentence-level quality scores constructed from fine-grained scores or error labels. Reference-free means that the metric rates the provided machine translation solely based on the provided source sentence/paragraph, without any additional, human written references. Further, we note that many open-source LLMs have mostly been trained on English data, adding further challenges to the reference-free setup. To summarize, the task will be structured as follows: - We provide a list of allowed LLMs from Huggingface - Participants should use prompting to use these LLMs as metrics for MT and summarization - Fine-tuning of the selected model(s) is not allowed - We will release baselines, which participants might build upon - We will provide a CodaLab dashboard to compare participants' solutions to others We plan to release a CodaLab submission environment together with baselines and dev set evaluation code successively until August 7. We will allow specific models from Huggingface, please refer to the webpage for more details: https://eval4nlp.github.io/2023/shared-task.html Best wishes, The Eval4NLP organizers [*] References are listed on the shared task webpage: https://eval4nlp.github.io/2023/shared-task.html

1 0

Third call for papers DHASA Conference 2023 Extended deadlines
by Menno Van Zaanen 07 Aug '23

07 Aug '23

Third call for papers DHASA Conference 2023 Extended deadlines https://dh2023.digitalhumanities.org.za/ Note extended deadlines Theme: "Digital Humanities for Inclusion" The Digital Humanities Association of Southern Africa (DHASA) is pleased to announce its fourth conference, focusing on the theme "Digital Humanities for Inclusion." In a region where the field of Digital Humanities is still relatively underdeveloped, this conference aims to address this gap and foster growth and collaboration in the field. The conference offers an opportunity for researchers interested in showcasing their work in the broad field of Digital Humanities to come together. By doing so, the conference provides a comprehensive overview of the current state-of-the-art in Digital Humanities, particularly within the Southern Africa region. As such, we welcome submissions related to Digital Humanities research conducted by individuals from Southern Africa or research focused on the geographical area of Southern Africa. Furthermore, the conference serves as a platform for information sharing and networking among researchers passionate about Digital Humanities. By bringing together experts working on Digital Humanities in Southern Africa or with a focus on Southern Africa, we aim to promote collaboration and facilitate further research in this dynamic field. In addition to the main conference, affiliated workshops and tutorials will be organized, providing researchers with valuable insights into novel technologies and tools. These supplementary events are designed for researchers interested in specific aspects of Digital Humanities or seeking practical information to enter or advance their knowledge in the field. The DHASA conference welcomes interdisciplinary contributions from researchers in various domains of Digital Humanities, including, but not limited to, language, literature, visual art, performance and theatre studies, media studies, music, history, sociology, psychology, language technologies, library studies, philosophy, methodologies, software and computation, and more. Our goal is to cultivate an inclusive scientific community of practice within Digital Humanities. Suggested topics include the following: * Digital archives and the preservation of marginalized voices; * Intersectionality and the digital humanities: exploring the intersections of race, gender, sexuality, and class in digital research and activism; * Activism and social change through digital media: how digital humanities tools and methodologies can be used to promote inclusion; * Engaging marginalized communities in the creation and use of digital tools and resources; * Exploring the role of digital humanities in decolonizing knowledge and promoting indigenous perspectives; * The ethics of data collection and analysis in digital humanities research related; * The role of digital humanities in promoting inclusive and equitable pedagogy; * Digital humanities and inclusion in the context of global perspectives and international collaborations; * Critical approaches to digital humanities and inclusion: examining the limitations and possibilities of digital tools and methodologies in promoting inclusion; and * Collaborative digital humanities projects with non-profit organizations, community groups, and cultural institutions; * Any other digital humanities-related topic that serves the Southern African community. Submission Guidelines The DHASA conference 2023 asks for three types of submissions: * Long papers: Authors may submit long papers consisting of a maximum of 8 content pages and unlimited pages for references and appendix. The final versions of accepted long papers will be granted an additional page (up to 9 pages) to incorporate reviewers' comments. * Short papers: Authors may submit short papers with a maximum of 5 content pages and unlimited pages for references and appendix. The final versions of accepted short papers will be allowed an extra page (up to 6 pages) to accommodate reviewers' comments. Short papers accepted for the conference will be presented as posters. * Abstracts: Authors can submit abstracts of 250-300 words. Note that before submitting your contribution, you are required to submit an abstract before the abstract submission deadline. This holds for *all* submissions. The actual submission will need to be submitted before the submission deadline. More information on the submission process can be found on the submission page: https://dh2023.digitalhumanities.org.za/submission/ We particularly encourage student submissions where the first author is a student. All accepted long and short paper submissions that are presented at the conference will be published in the Journal of Digital Humanities Association of Southern Africa, see https://upjournals.up.ac.za/index.php/dhasa. In addition, the abstracts of the full papers and the lightning talks will be published in a book of abstracts before the conference. Important dates Abstract submission deadline: *22 August 2023* Submission deadline: *29 August 2023* Date of notification: 30 September 2023 Camera-ready copy deadline: 6 November 2023 Conference: 27 November 2023 - 1 December 2023 Conference format: Face-to-face Conference venue: Nelson Mandela University, Eastern Cape South Africa NOTE: Non-presenting delegates have the option to attend online. Co-located events Several co-located events are currently being prepared, including workshops and tutorials. These will be updated on the conference website. Organizing Committee * Johannes Sibeko, Nelson Mandela University * Aby Louw, Council for Scientific and Industrial Research * Alan Murdoch, Nelson Mandela University * Amanda du Preez, University of Pretoria * Andiswa Bukula, South African Centre for Digital Language Resources * Andiswa Mvanyashe, Nelson Mandela University * Avashna Govender, Council for Scientific and Industrial Research * Gabby Dlamini, Nelson Mandela University * Ilana Wilken, Council for Scientific and Industrial Research * Jonathan van der Walt, Nelson Mandela University * Laurette Marais, Council for Scientific and Industrial Research * Mukhtar Raban, Nelson Mandela University * Nomfundo Khumalo, Nelson Mandela University * Menno Van Zaanen, South African Centre for Digital Language Resources -- Prof Menno van Zaanen menno.vanzaanen(a)nwu.ac.za Professor in Digital Humanities South African Centre for Digital Language Resources https://www.sadilar.org ________________________________ NWU PRIVACY STATEMENT: http://www.nwu.ac.za/it/gov-man/disclaimer.html DISCLAIMER: This e-mail message and attachments thereto are intended solely for the recipient(s) and may contain confidential and privileged information. Any unauthorised review, use, disclosure, or distribution is prohibited. If you have received the e-mail by mistake, please contact the sender or reply e-mail and delete the e-mail and its attachments (where appropriate) from your system. ________________________________

1 0

Call for participation: SIGDIAL-INLG 2023, Sep 11-15, Prague Czechia
by Ondrej Dusek 07 Aug '23

07 Aug '23

*** Call for Participation *** SIGDIAL & INLG 2023 Conferences September 11-15, 2023 Prague, Czechia & online https://sigdialinlg2023.github.io/ Early Registration Deadline: **August 10** Late Registration Deadline: September 15 Non-presenters Free Registration: August 12 - September 15 Workshops: September 11-12 Main Conferences: September 13-15 The 24th Annual Meeting of the Special Interest Group on Discourse and Dialogue (SIGdial 2023) and the 16th International Natural Language Generation Conference (INLG 2023) will be held jointly this year in Prague, Czechia. The event will be hybrid, but in-person participation is strongly encouraged! Virtual attendance will be free for non-presenters. The organizers of SIGDIAL & INLG 2023 invite all researchers and practitioners, SIGDial & SIGGEN members, and SIGDIAL & INLG 2023 industry partners and sponsors to join the conference. The registration is now open, with early rates available until August 10 (see https://sigdialinlg2023.github.io/registration.html). A limited number of hotel rooms at the conference venue is available for booking at special rates until August 10. **SIGDIAL** provides a regular forum for the presentation of cutting-edge research in discourse and dialogue to both academic and industry researchers. Continuing a series of 23 successful previous meetings, this conference spans the research interest areas of discourse and dialogue. The conference is sponsored by the SIGdial organization, which serves as the Special Interest Group on discourse and dialogue for both ACL and ISCA. **INLG** is a yearly venue for presentations related to all aspects of Natural Language Generation (NLG), including data-to-text, concept-to-text, text-to-text and vision-to-text approaches. The event is organized under the auspices of SIGGEN, the Special Interest Group on Natural Language Generation of ACL. The joint conference on Sep 13-15 will feature 4 keynote speeches by Barbara Di Eugenio, Emmanuel Dupoux, Elena Simperl and Ryan Lowe, as well as a number of regular paper presentations and system demonstrations. - Keynotes: https://sigdialinlg2023.github.io/speakers.html - SIGDIAL accepted papers: https://2023.sigdial.org/accepted-papers/ - INLG accepted papers: https://inlg2023.github.io/accepted_papers.html The event includes several workshops on Sep 11-12 (see https://sigdialinlg2023.github.io/workshops.html): - YRRSDS: 19th Young Researchers' Roundtable on Spoken Dialogue Systems - The 1st Workshop on Counter Speech for Online Abuse - DSTC11: The 11th Dialog System Technology Challenge - PracticalD2T: 1st Workshop on Practical LLM-assisted Data-to-Text Generation - Taming Large Language Models: Controllability in the era of Interactive Assistants - Workshop on Multimodal, Multilingual Natural Language Generation and Multilingual WebNLG Challenge - Connecting multiple disciplines to AI techniques in interaction-centric autism research and diagnosis - Designing divergent agent tasks for SDS data collection We thank you for your support and look forward to welcoming you at the conference! Best regards, SIGDIAL & INLG 2023 Organizers (Apologies for cross-posting)

1 0

2025

2024

2023

2022

Corpora August 2023