CEA List, a research institute of Paris-Saclay University, is looking for a Postdoctoral Fellow to join its laboratory of semantic analysis of texts and images.
In the context of the DeepGenSeq project, the person hired will integrate an interdisciplinary team aiming to move closer to the goal of predictive and generative artificial intelligence for biology by exploiting deep contextual language models of biological sequences, which representations generalize to several applications like the prediction of mutational effects.
BACKGROUND Exponential growth in sequencing throughput together with the sampling of natural (uncultured) populations are providing a deeper view of the diversity of proteins sequences across the tree of life. Proteins are molecular engines sustaining cellular life and the unobserved determinants of their structure and function are encoded in the distribution of observed natural sequences. Therefore, such vast amounts of (unlabeled) sequences provide evolutionary data that can form the ground for unsupervised learning of predictive and generative models of biological function.
Recent advances in machine learning, with the development of the transformer architecture, have allowed the emergence of powerful language models that can be used to model proteins sequences. Through transfer learning, the learned representations can be used to detect homology (i.e. the relatedness between two protein sequences), predict secondary and tertiary structures, predict residue-residue contacts or predict fluorescence landscape.
CHALLENGES AND OBJECTIVES Our focus here will be to develop high-capacity transformer-based language models on protein sequence data. Intrinsic organising principles captured in the resulting representations can then be applied in transfer learning settings to different predictive sub-tasks using limited experimental data (e.g. the effect of sequence variation on protein function). Following promising recent results, we plan to also explore zero-shot inference with no additional training and/or supervision from experimental data.
Responsibilities: * Tune and optimize existing unsupervised transformer-based language models for protein sequences. * Develop and optimize code and machine learning algorithms for predictive models. * Integrate and analyze large data volumes. * Interact continuously with scientists in an interdisciplinary team.
APPLICATION This project will be an excellent opportunity for a candidate who is looking to contribute to cutting-edge research and to train with experts in the field. We are seeking here a detail-oriented computer scientist and problem solver passionate in science. This 2 years position is open to a range of candidates from recent college graduates to more experienced scientists (e.g. post-docs) The ideal candidate should have the following qualifications:
* Ph.D. or M.Sc. in Applied Mathematics, Computer Science, or Computational Biology. * Experience in Deep Learning methods. * Experience with Python, open-source software libraries for machine learning and Linux. * Strong mathematical background and analytical skills. * Effective organizational skills, e.g. the ability to prioritize work and contribute to the planning of a program of scientific research. * Demonstrated interpersonal skills including both the ability to work independently and perform collaborative research in an interdisciplinary team environment. * Good oral and written communication skills.
Preferred: Previous experience with transformer-based techniques for NLP pre-training and transformer language models
TERMS & COMPENSATION This 2 years position is open to a range of candidates from recent college graduates to more experienced scientists (e.g. post-docs) – the chosen candidate's salary will be commensurate with their level of education, skills, and experience. Other benefits include: - 48 days of paid holidays - on-site subsidized restaurant - partial remote work is possible, up to 3 days per week within the limit of 100 days per year - CEA contribution to the personal company savings plan
LOCATION We are based on the Paris-Saclay research campus in the south of Paris, France.
HOW TO APPLY Interested candidates should submit a resume and short cover letter to deepgenseq «at» saxifrage.saclay.cea.fr
ABOUT US About CEA-List: https://list.cea.fr/en/
About the LASTI lab: https://kalisteo.cea.fr/index.php/ai/ https://kalisteo.cea.fr/index.php/textual-and-visual-semantic/
About Genoscope: https://www.genoscope.cns.fr