Two public online talks in August and September on Pre-trained Language Models / Foundation Models (NLPV seminars, Exeter) - Corpora

20 Aug 2025


      We welcome you to the next Natural Language Processing and Vision (NLPV) seminars at the University of Exeter.
Talk 1
Scheduled: Thursday 21 Aug 2025 at 16:00 to 17:00, GMT+1
Location: https://Universityofexeter.zoom.us/j/97587944439?pwd=h4rnPO0PafT9oRrrqQsezGZ... (Meeting ID: 975 8794 4439 Password: 064414)
Title: Trustworthy Optimization of Pre-Trained Models for Healthcare: Generalizability, Adaptability, and Security
Abstract: Pre-trained language models have opened new possibilities in healthcare, showing promise in mining scientific literature, analyzing large-scale clinical data, identifying patterns in emerging diseases, and automating workflows, positioning themselves as intelligent research assistants. However, general-purpose models, typically trained on web-scale corpora, often lack the clinical grounding necessary for reliable deployment in high-stakes domains like healthcare. To be effective, they must be adapted to meet domain-specific requirements. My PhD thesis addresses three core challenges in leveraging pre-trained models for healthcare: (i) the scarcity of labeled data for fine-tuning, (ii) the evolving nature of healthcare data, and (iii) the need to ensure transparency and traceability of AI-generated content. In this talk, I will focus on the third challenge: enabling traceability of content generated by large language models. I will begin with an overview of prior watermarking approaches and then present our proposed solution. We introduce a watermarking algorithm applied at inference time that perturbs the model’s logits to bias generation toward a subset of vocabulary tokens determined by a secret key. To ensure that watermarking does not compromise generation quality, we propose a multi-objective optimization (MOO) framework that employs lightweight networks to produce token-specific watermarking logits and splitting ratios, specifying how many tokens to bias and by how much. This approach effectively balances watermark detectability with semantic coherence. Experimental results show that our method significantly improves detectability and robustness against removal attacks while preserving the semantics of the generated text, outperforming existing watermarking techniques.
Speaker's bio: Dr. Sai Ashish Somayajula is a Senior Applied Scientist in Generative AI at Oracle Cloud Infrastructure, where he develops large-scale foundation models for enterprise applications. He earned his PhD in Electrical and Computer Engineering from the University of California (UC), San Diego. His research focused on addressing key challenges in adapting and utilizing pre-trained models for healthcare. Specifically, his work spanned three core areas: (1) synthetic data generation using meta-learning-based feedback mechanisms, (2) continual learning for handling dynamic data streams without catastrophic forgetting, and (3) token-level watermarking techniques to ensure content provenance and security. His research has been published in premier venues, including the International Conference on Machine Learning (ICML), Annual Meeting of the Association for Computational Linguistics (ACL), Transactions of the Association for Computational Linguistics (TACL), Conference of the North American Chapter of the Association for Computational Linguistics (NAACL), Scientific Reports (Nature Portfolio), and Transactions of Machine Learning Research (TMLR). He is a recipient of the Jacobs School of Engineering Departmental Fellowship at UC San Diego. Ashish has collaborated with leading industrial research labs through internships at Apple and Tencent AI Lab. He holds a Bachelor's degree in Electrical Engineering with a minor in Computer Science from the Indian Institute of Technology, Hyderabad, where he was twice awarded the Academic Excellence Award, and a Master’s in Intelligent Systems and Robotics from UC San Diego.
Talk 2
Scheduled: Thursday 4 Sep 2025 at 13:00 to 14:00, GMT+1
Location: https://Universityofexeter.zoom.us/j/95827730937?pwd=Te1wejfgr68A5lplwLQjxwg...
(Meeting ID: 958 2773 0937 Password: 879296)
Title: Towards end-to-end tokenization and adaptive memory in foundation models
Abstract: Foundation models (FMs) process information as a sequence of internal representations; however, the length of this sequence is fixed and entirely determined by tokenization. This essentially decouples representation granularity from information content, which exacerbates the deployment costs of FMs and narrows their “horizons” in long sequences. What if, instead, we could free FMs from tokenizers by modelling bytes directly, while making them faster than current tokenizer-bound FMs? I argue that a recipe to achieve this goal already exists. In particular, I helped prototype how to: 1) dynamically pool representations in internal layers, progressively learning abstractions from raw data; 2) compress the KV cache of Transformers during generation without loss of performance; 3) predict multiple bytes per time step in an efficient yet expressive way; 4) retrofit existing tokenizer-bound FMs into byte-level FMs through cross-tokenizer distillation. By blending these ingredients, we may soon witness the emergence of efficient byte-level FMs.
Speaker's short bio (based on website): Edoardo Ponti is an assistant professor in Natural Language Processing at the University of Edinburgh and a visiting professor at NVIDIA. His research focuses on efficient architectures (see NeurIPS 2024 tutorial on dynamic sparsity), modular deep learning (designing neural architectures that route information to specialised modules, e.g., sparse subnetworks), and computational typology (understand how languages vary, across the world and its cultures, within a computational and mathematical framework). Previously, Edorado was a visiting postdoctoral scholar at Stanford University and a postdoctoral fellow in computer science at Mila - Quebec AI Institute in Montreal. In 2021, Edorado obtained a PhD from the University of Cambridge, St John’s College. Once upon a time Edorado studied typological and historical linguistics at the University of Pavia. Edoardo’s research has been featured on the Economist and Scientific American, among others. Edoardo received a Google Research Faculty Award and 2 Best Paper Awards at EMNLP 2021 and RepL4NLP 2019. Edoardo is a board member of SIGTYP, the ACL special interest group for computational typology, a Scholar of the European Lab for Learning and Intelligent Systems (ELLIS), and part of the TACL journal editorial team.
We will update future talks at the website: https://sites.google.com/view/neurocognit-lang-viz-group/seminars
Joining our *Google group* for future seminar and research information: https://groups.google.com/g/neurocognition-language-and-vision-processing-gr...