Research:: Machine Learning for Bioinformatics

My research efforts in machine learning for bioinformatics are focused on

  • developing and applying computational techniques for the analysis of biological data and
  • modeling biological processes at the molecular level.

The broad aim is to provide computational tools to assist researchers to understand, explain and predict the behavior of complex biological systems. My research activities take place in the Cancer Systems Biology Laboratory.

Our umbrella project is Comprehensive Resource of Biomedical Relations with Deep Learning and Knowledge Graph Representations (CROssBAR) [WebPage]

ACTIVE RESEARCH PROJECTS

Drug-Target Interaction and Affinity Prediction
DEEPScreen is a high-performance drug–target interaction predictor that utilizes convolutional neural networks and 2-D structural compound representations to predict their activity against intended target proteins. DEEPScreen system is composed of 704 target protein-specific prediction models, each independently trained using experimental bioactivity measurements against many drug candidate small molecules, and optimized according to the binding properties of the target proteins.
MDeePred is a deep-learning method that produces compound-target binding affinity predictions to be used for the purposes of computational drug discovery and repositioning. The method adopts the chemogenomic approach, where both the compound and target protein features are employed at the input level to model their interaction, which enables the prediction of inhibitors to under-studied or completely non-targeted proteins. In MDeePred, multiple types of protein features such as sequence, structural, evolutionary, and physicochemical properties are incorporated within multi-channel 2-D vectors, which are then fed to state-of-the-art pairwise input hybrid deep neural networks to predict the real-valued compound-target protein interactions.
We currently investigate the use of transfer learning, low-shot learning, and single-shot learning for drug-target interaction prediction.

Software
MDeePred [GitHub] pairwise input deep neural network regressor for drug-target affinity prediction for virtual screening
iBioProVis [WebService] visualization of compounds on 2D space in the context of their cognate targets
DEEPScreen [GitHub] virtual screening with deep convolutional neural networks using compound images

Publications
MDeePred: novel multi-channel protein featurization for deep learning-based binding affinity prediction in drug discovery, Bioinformatics btaa858   [article]
– DEEPScreen: high performance drug–target interaction prediction with convolutional neural networks using 2-D structural compound representations, Chemical Science 11:2531-2557 [article]
– iBioProVis: interactive visualization and analysis of compound bioactivity space, Bioinformatics 36:4227–4230 [article]

Automated Protein Function Prediction
Automated protein function prediction is critical for the annotation of uncharacterized protein sequences, where accurate prediction methods are still required. Recently, deep learning-based methods have outperformed conventional algorithms in computer vision and natural language processing due to the prevention of overfitting and efficient training. Here, we propose DEEPred, a hierarchical stack of multi-task feed-forward deep neural networks, as a solution to Gene Ontology (GO) based protein function prediction. DEEPred was optimized through rigorous hyper-parameter tests and benchmarked using three types of protein descriptors, training datasets with varying sizes and GO terms from different levels. Furthermore, in order to explore how training with larger but potentially noisy data would change the performance, electronically made GO annotations were also included in the training process. The overall predictive performance of DEEPred was assessed using CAFA2 and CAFA3 challenge datasets, in comparison with the state-of-the-art protein function prediction methods. Finally, we evaluated selected novel annotations produced by DEEPred with a literature-based case study considering the ‘biofilm formation process’ in Pseudomonas aeruginosa. This study reports that deep learning algorithms have significant potential in protein function prediction; particularly when the source data is large. The neural network architecture of DEEPred can also be applied to the prediction of the other types of ontological associations.

Publication
A.S. Rifaioglu, T. Doğan, M.J. Martin, R. Cetin-Atalay, V. Atalay, V. (2019). DEEPred: automated protein function prediction with multi-task feed-forward deep neural networks, Scientific Reports, 1 9(1), 1-16, https://doi.org/10.1186/s12859-018-2368-y

Software
DEEPred [GitHub] automated function prediction

PREVIOUS RESEARCH PROJECTS

Genome Annotation based on Sequence Analysis
For the analysis of sequences, we have developed a generative system based on feature space mapping, called Subsequence Profile Map (SPMap). SPMap can be fed to a discriminative classifier for prediction purposes. Instead of focusing motifs in a sequence, SPMap considers all of the subsequences as a distribution over a quantized space by discretizing and reducing the dimension of an otherwise huge space of all possible subsequences. We have already applied SPMap and other feature generating methods onto Automated Protein Function Prediction and Enzyme Class Prediction problems.
Software
UniGOPred [WebService] automated protein function prediction tool based on Gene Ontology (GO) terms and a database of GO term predictions for UniProtKB
ECPred 
[WebService][GitHub] Enzyme Commission (EC) number prediction
Publications
– A.S. Rifaioglu, T. Doğan, Ö.S. Saraç, T. Ersahin, R. Saidi, V. Atalay, M.J. Martin, M.J, R. Cetin-Atalay, “Large-scale automated multi-functional annotation of protein sequences and an experimental case study validation on PTEN transcript variants”, Proteins, 00:1–17, 2017, https://doi.org/10.1002/prot.25416
– A. Dalkiran, A.S. Rifaioglu, M.J. Martin, R. Cetin-Atalay, V. Atalay, T. Doğan, “ECPred: a tool for the prediction of the enzymatic functions of protein sequences based on the EC nomenclature”, BMC Bioinformatics , 19:334, 2018, https://doi.org/10.1186/s12859-018-2368-y

Generating Readable Layouts for Biological Graphs
Although force-directed layout algorithm could be used to draw biological graphs, modification is required when we would like to embed domain-specific knowledge. We proposed a modified and improved (Kamada-Kawai) force-directed layout algorithm, EClerize, to generate more readable layouts for biological graphs that represent pathways in which the vertices are identified with EC (Enzyme Commission) numbers. While the vertices with the same EC class numbers are treated as members of the same cluster, positions of vertices in clusters are affected by the biological similarity of each vertex in the same cluster and the theoretical length between the vertices. Cytoscape
Publication
– H.F. Danaci, R. Cetin-Atalay, V. Atalay, “EClerize: A customized force-directed graph drawing algorithm for biological graphs with EC attributes”,  J Bioinform Comput Biol.,16(4):1850007. doi:10.1142/S0219720018500075

Evaluating the Biological Activity of Genes and Processes in Pathways
As an alternative to already existing functional enrichment methods aimed at identifying significant biological processes/pathways on the basis of experimental data, we have proposed and developed an approach to assess the activity of cellular pathways on the basis of experimental data. The approach is based on a conversion of the pathway to a directed graph and on a score flow algorithm that initializes scores of pathway nodes relying on experimental data and then iteratively updates scores until convergence is reached. The algorithm has been implemented as a Cytoscape plug-in, Pathway Scoring Application and tested by relying on different sets of paired transcriptome/Chip-seq data and relying on KEGG pathways. The algorithm has been further tested as an in silico gene knockout tool by relying on a manually constructed pathway. Our current effort is on developing a probabilistic computational method for this approach.
Publications
– Z. Isik, T. Ersahin, V. Atalay, C. Aykanat, R. Cetin-Atalay, “A signal transduction score flow algorithm for cyclic cellular pathway analysis, which combines transcriptome and ChIP-seq data”, Molecular BioSystems, 8, p.3224-3231, 2012, doi:10.1039/C2MB25215E.
– Z. Işık, V. Atalay, R. Çetin-Atalay, “Evaluation of Signaling Cascades Based on the Weights from Microarray and ChIP-seq Data”, Journal of Machine Learning Research W&C ProceedingsMIT Press, Vol.8, pp.44-54, 2010.
– Z. Isik, V. Atalay, R. Cetin-Atalay, “Evaluation of Signaling Cascades Based on the Weights from Microarray and ChIP-seq Data”, Third International Workshop on Machine Learning in Systems Biology (MLSB 2009), Ljubljana, Slovenia, September 5-6, 2009.

Identification of Novel Reference Genes Based on MeSH Categories
Even the most frequently used reference genes are subject to differential regulation under specific treatments or between different cell lines or tissues. We have devised a method that provides alternative reference gene lists for global and cell-type specific normalization of transcriptome data. Gene lists are scored based on their expression stability and classified according to the Medical Subject Headings (MeSH) associated with the transcriptome study that was published and indexed by National Library of Medicine.
Publication
– L. Çarkacıoğlu, R. Çetin-Atalay, Ö. Konu, V. Atalay, T. Can, “Bi-k-Bi Clustering: Mining Large Scale Gene Expression Data Using Two-Level Biclustering”, International Journal of Data Mining and Bioinformatics, Inderscience, Vol. 4, No.6 pp.701-721, 2010, doi:10.1504/IJDMB.2010.037548.