Università degli Studi di Milano
DOTTORATO DI RICERCA IN BIOLOGIA MOLECOLARE E CELLULARE
38° CICLO (2022)
Large-scale druggability assessment of target libraries
with artificial intelligence-based
structure-sequence combined features
(Proposta dottorato cofinanziato DM352)
Location: Dipartimento di Bioscienze, Università degli Studi di Milano
The PhD project will focus on the development and engineering of deep learning methods applied to the characterisation of the mechanistic basis of drug-target interaction. The PhD student’s training will therefore be aimed at learning structural bioinformatics data analysis methodologies, biophysics-based molecular simulation methods, the application of bioinformatics tools for the alignment of time series and sequences, and their corresponding use as input features to AI algorithms, with reference to their application in proteomics. Particular attention will be paid to the processing of large-scale data using HPC architectures and the study of recent neural architectures based on latent space (variational autoencoders, recurrent neural networks, transformers) and their respective applications to feature engineering based on proteomic data.
Approach and methodology
The proposed PhD aims to address data mining, computational optimisation, and machine learning problems arising in support of new drug discovery with a modelling/computational focus, with particular emphasis on data-driven approaches. The proposed research activity is therefore based on an interdisciplinary approach at the interface between the fields of quantitative biology, AI-computational biophysics, and pharmacology. The activity is motivated by the convergence between (a) the increasing availability of proteomic data, both in the form of annotated genomes and in the form of metagenomes; (b) artificial intelligence techniques capable of analysing heterogeneous data, such as alignments between sequences, time series, and three-dimensional structures; (c) the availability of HPC computing power and corresponding architectures (including multi-GPU) for deep learning.
In this thesis, for example, data-driven feature engineering approaches will be applied to support open problems in drug discovery, such as pocket identification (1), peptidomimetic development, fold stabilisation (2), AI-based generation of therapeutic hypotheses (3) and target classification (4). The growing maturity of the field and the exponential increase in the size of proteomic datasets requires joint data science-biological approaches to problems posed by computational structural biology. Introducing a strong component of data mining, deep neural architectures and data-driven optimisation in this field would offer the possibility to develop solutions based on physicochemical and statistical models in an integrated manner, as shown for instance in the case of impact prediction of Sars-COV-2 variants (5).
The work will be directed by Dr. Toni Giorgino (www.giorginolab.it), computational biophysicist at IBF-CNR, who will support the candidate and assist him/her in defining activities and selecting the most suitable methodologies for the type of data and problem.
A network of collaboration between the Dipartimento di Scienze Biomediche, the co-tutor, CNR-IBF, and the company Dompé will enable the PhD student to integrate with the national and international research environment, both academic and industrial, through active participation in research projects, training schools, courses, congresses and conferences. For example:
- Conferences and mobility supported by Cost Action CA 18133 ‘ERNEST’ - European Research Network on Signal Transduction;
- The international mobility supported by the HPC-EUROPA3 network for high performance computing;
- The CNR STM (short term mobility) grants;
- Seminars and summer schools organised by CINECA and NVIDIA;
- Zymvol Biomodeling, a company working on the computational optimization of enzymes;
- Ongoing collaborations with computational and experimental groups at Universitat Pompeu Fabra (Barcelona), University of Verona, University of Girona, SISSA, etc.
The PhD student’s training will provide him/her with the opportunity to apply the scientific skills developed to open research problems from the academic/industrial world.
1. Jiménez J, Doerr S, Martínez-Rosell G, Rose AS, De Fabritiis G. DeepSite: protein-binding site predictor using 3D-convolutional neural networks. Bioinformatics. 2017 Oct 1;33(19):3036–42.
2. Giorgino T, Mattioni D, Hassan A, Milani M, Mastrangelo E, Barbiroli A, et al. Nanobody interaction unveils structure, dynamics and proteotoxicity of the Finnish-type amyloidogenic gelsolin variant. Biochim Biophys Acta BBA - Mol Basis Dis. 2019 Mar 1;1865(3):648–60.
3. Cossu F, Sorrentino L, Fagnani E, Zaffaroni M, Milani M, Giorgino T, et al. Computational and Experimental Characterization of NF023, A Candidate Anticancer Compound Inhibiting cIAP2/TRAF2 Assembly. J Chem Inf Model. 2020 Sep 4;60(10):5036–44.
4. Nicoli A, Dunkel A, Giorgino T, de Graaf C, Di Pizio A. Classification Model for the Second Extracellular Loop of Class A GPCRs. J Chem Inf Model. 2022 Feb 14;62(3):511–22.
5. Torrens-Fontanals M, Peralta-García A, Talarico C, Guixà-González R, Giorgino T, Selent J. SCoV2-MD: a database for the dynamics of the SARS-CoV-2 proteome and variant impact predictions. Nucleic Acids Res. 2022 Jan 7;50(D1):D858–66.
6. Sheils TK, Mathias SL, Kelleher KJ, Siramshetty VB, Nguyen DT, Bologa CG, et al. TCRD and Pharos 2021: mining the human proteome for disease biology. Nucleic Acids Res. 2021 Jan 8;49(D1):D1334–46.
7. Ferruz N, Höcker B. Dreaming ideal protein structures. Nat Biotechnol. 2022 Feb;40(2):171–2.
8. Brandes N, Ofer D, Peleg Y, Rappoport N, Linial M. ProteinBERT: a universal deep-learning model of protein sequence and function. Bioinformatics. 2022 Apr 15;38(8):2102–10.
9. Rives A, Meier J, Sercu T, Goyal S, Lin Z, Liu J, et al. Biological structure and function emerge from scaling unsupervised learning to 250 million protein sequences. Proc Natl Acad Sci [Internet]. 2021 Apr 13 [cited 2021 May 15];118(15). Available from: https://www.pnas.org/content/118/15/e2016239118
10. The PLUMED consortium. Promoting transparency and reproducibility in enhanced molecular simulations. Nat Methods. 2019 Aug;16(8):670–3.