Structure-Based Function Prediction using Graph Convolutional Networks

Recent massive increases in the number of sequences available in public databases challenges current experimental approaches to determining protein function. These methods are limited by both the large scale of these sequences databases and the diversity of protein functions. We present a deep learning Graph Convolutional Network (GCN) trained on sequence and structural data and evaluate it on ~40k proteins with known structures and functions from the Protein Data Bank (PDB). Our GCN predicts functions more accurately than Convolutional Neural Networks trained on sequence data alone and competing methods. Feature extraction via a language model removes the need for constructing multiple sequence alignments or feature engineering. Our model learns general structure-function relationships by robustly predicting functions of proteins with ≤ 30% sequence identity to the training set. Using class activation mapping, we can automatically identify structural regions at the residue-level that lead to each function prediction for every protein confidently predicted, advancing site-specific function prediction. De-noising inherent in the trained model allows an only minor drop in performance when structure predictions are used, including multiple de novo protocols. We use our method to annotate all proteins in the PDB, making several new confident function predictions spanning both fold and function trees.

Recent massive increases in the number of sequences available in public databases challenges current experimental approaches to determining protein function. These methods are limited by both the large scale of these sequences databases and the diversity of protein functions. We present a deep learning Graph Convolutional Network (GCN) trained on sequence and structural data and evaluate it on~40k proteins with known structures and functions from the Protein Data Bank (PDB). Our GCN predicts functions more accurately than Convolutional Neural Networks trained on sequence data alone and competing methods. Feature extraction via a language model removes the need for constructing multiple sequence alignments or feature engineering. Our model learns general structure-function relationships by robustly predicting functions of proteins with ≤ 30% sequence identity to the training set. Using class activation mapping, we can automatically identify structural regions at the residue-level that lead to each function prediction for every protein confidently predicted, advancing site-specific function prediction. De-noising inherent in the trained model allows an only minor drop in performance when structure predictions are used, including multiple de novo protocols. We use our method to annotate all proteins in the PDB, making several new confident function predictions spanning both fold and function trees.
Proteins are linear chains of amino acid residues that fold into 3-dimensional structures to carry out a wide variety of functions within the cell. Even though many (5-30%, depending on the organism) functional regions of proteins are disordered (lack a well defined ensemble average) the majority of protein domains in natural proteins fold into specific and ordered three-dimensional conformations as a result of the physical interactions within the chain [1][2][3][4][5] . The structural features of proteins, in turn, determine the wide range of functions: from binding specificity, forming structures within the cell, to catalysis of biochemical reactions, transport, or signal transduction. There are several widely used classification schemes that help to organize these myriad protein functions including: the Gene Ontology (GO) Consortium 6 , the Comprehensive Enzyme Information System (BRENDA) 7 , Enzyme Commission (EC) numbers 8

, Kyoto Encyclopedia of Genes and
Genomes (KEGG) 9 , and others. GO, for example, classifies proteins into hierarchically related functional classes (also called GO terms) organized into 3 different ontologies: Molecular Function (MF), Biological Process (BP) and Cellular Component (CC) describing different aspects of protein functions.
The advent of low-cost and efficient protein sequencing technologies has resulted in the massive growth in the number of sequences available in key protein sequence databases, like the UniProt Knowledgebase (UniProtKB, http://uniprot.org). UniProt currently contains over 100 million sequences and only~0.5 million sequences (0.5%) that are manually annotated (UniProtKB/Swiss-Prot). Most proteins with unknown function (i.e., hypothetical proteins) are unlikely to be experimentally characterized. Understanding the functional roles and studying the mechanisms of these newly discovered proteins, in both health and disease, is one of the most important biological problems in the post-genomic era. In parallel to the growth of sequence data, the advent of experimental as well computational techniques in structure biology has made the three-dimensional structures of many proteins available [10][11][12][13][14][15][16] .
The Protein Data Bank (PDB, http://wwpdb.org) remains the main repository of information about the three-dimensional structures of proteins, nucleic acids, and complex assemblies, and has also experienced significant growth in recent years, reaching over 150,000 entries.
To address the sequence-function gap many computational methods have been developed over the years. These methods typically aim to predict protein function for whole protein genes, but much work is also directed at the related problem of predicting function in a site-or domainspecific manner (that automatically generates functional hypothesis linked to residues, regions or domains) [17][18][19][20] . Traditional machine learning classifiers, such as support vector machines, random forests, and high-dimensional statistical methods like logistic regression have been used extensively for the protein function prediction problem, and have established that integrative prediction schemes can outperform homology-based function transfer 21,22 . Systematic benchmarking efforts, such as the Critical Assessment of Functional Annotation (CAFA1 23 & CAFA2 24 ) and MouseFunc 25 , have also played a key role in the development of these methods and have shown that integrative machine learning and statistical methods outperform traditional sequence alignmentbased methods (e.g., BLAST) 22 . However, the performance of these methods is typically strongly affected by the quality of manually-engineered features constructed from either sequence or structure (features that rely heavily on heuristics that in turn require domain-expert knowledge, and in some cases unstable assumptions, thresholds and preprocessing pipelines) 26 . Here, we focus on methods that can take as inputs sequence and features that are readily derived from sequence (such as predicted structure) and do not focus on, or compare to, the many methods that rely on protein networks like GeneMANIA 27 , Mashup 28 , DeepNF 29 , and other integrative network prediction methods. We focus our study in this way to present a method that can be applied to very large volumes of sequence where many proteins are from unknown organisms lacking the required network data (and thus hope to address the critical need for these methods in metagenomic contexts).
In the last decade, deep learning approaches have achieved unprecedented increase in performance on a broad spectrum of problems ranging from learning protein sequence embeddings for contact map prediction 30 to predicting protein structure 31, 32 and function 33 . In particular, Convolutional Neural Networks (CNN) 34 , that have been state-of-the-art in computer vision, have also shown tremendous success in addressing problems in computational biology. They enabled taskspecific feature extraction directly from protein sequence or its 3D structure overcoming the limitations of feature-based ML methods. The majority of sequence-based protein function prediction methods use 1D CNNs, or variations thereof, that search for recurring spatial patterns within a given sequence and converts them hierarchically into complex features using multiple convolutional layers.
Recent work has employed 3D CNNs to make predictions and extract features from protein structural data 35,36 . These methods take as input a 3D volumetric protein structure represented on a grid. Storing explicit 3D representations of protein structure at high resolution is not memory efficient (most of the 3D space is unoccupied by protein structure); thus, in most cases, the 3D CNN would convolve over empty space which is somewhat inefficient. More recently, geometric deep learning methods 37 and more specifically Graph Convolutional Networks (GCNs) 38,39 have offered a way to overcome these limitations by generalizing convolutional operations on more natural graph-like molecular representations. Graph Convolutional Networks have shown tremendous success in various problems ranging from learning useful molecular fingerprints 40 , to predicting biochemical activity of drugs 41 , to protein interface prediction 42 .
Here, we describe a method based on GCNs for functionally annotating protein sequences and structures that outperforms current methods and scales to the size of current repositories of sequence information. We model protein structures as graphs that are derived from protein contact maps (truncated residue-residue-pair distance maps). Residue-level sequence features together with contact maps, are fed into GCNs. The GCNs uses a deep architecture to further propagate residue-level features between residues at different proximity to each other in the protein contact graph to construct final protein-level feature representations that prove useful for protein function prediction. For learning sequence features we use Bidirectional Long Short-Term memory Language Model (LSTM-LM) pretrained on a corpus of around 2 million protein sequences. Our LM is trained to predict an amino-acid residue in the context of residues before and after it in a protein sequences. Using features from a pre-trained, task-agnostic LSTM LM as input to classification tasks has demonstrated tremendous success in many NLP 43 and biological problems 30 . We show that such features can significantly boost performance of GCN in function prediction task. Using LM features together with contact maps of experimental PDB structures we show that our method outperforms sequences-only state-of-the-art methods. Moreover, by testing our method on de novo predicted structures we show that our method is robust to expected errors and can significantly de-noise predicted structures while still confidently predicting their functions.
In addition to improved accuracy of function predictions, our method also provides the abil- Here, we propose a similar approach, adapted for GCNs, for detecting functional regions in proteins. For each PDB chain, CAM detects GO term-specific sites on its 3D structure by identifying residues relevant for making accurate GO term prediction. Here, we show that, for various GO terms, these functional sites often correspond to known binding regions, conserved regions or active sites. Interestingly, our model is not explicitly trained to predict functional sites, but instead such predictions stem solely from the CAM analysis of the graph convolution parameters of the trained model. Performing such analysis for identifying functional sites is also very efficient as it does not require any further training or modification of the model's architecture.
As we demonstrate below, analysing results from CAM approach and finding their biological meanings is challenging for some protein structures. However, in most cases, this approach can automatically navigate the hierarchy from small sites, to larger binding sites, to domains to whole-protein localized functions. The site-specificity afforded by our function predictions is very valuable, especially in the case when predicting functions of poorly studied, unannotated proteins. Site-specific predictions provide first insights into the correctness of predictions and frames follow-up genetics or validation experiments (for example, highlighting the salient residues in the protein's 3D structure, detected by CAM, could serve as a potential validation technique to many biochemists and other domain experts studying predictions made by our model).

Method overview
Our model takes as input a protein sequence and structure (in the form of a contact map) and outputs GO term probabilities. The method consists of two main parts: a LSTM LM that is learned from a very large corpus of protein sequences ( Figure 1A), and a GCN that uses protein structure 2 Evaluating our method on predicted structures Here, we ask how well can our method tolerate the error in predicted structures. We demonstrate this for the Rosetta de novo prediction procedure and for another de novo deep-leaning-based, structure prediction method, 11 . We used Rosetta macromolecular modeling suite 51   We run our method on both predicted (Rosetta de novo) and native (derived from high quality experimental structures) contact maps and report the results together with results of the CNN applied only on sequences in Figure 3. We observe that our GCN model exhibits higher performance than that of the CNN even when accounting for error in predicted contact maps. Even though Rosetta-predicted structures often result in noisy contact maps, the fact that the performance of our method on the predicted LE structures is not drastically impaired can be attributed to the high denoising ability of the GCN implied by high correlation between GCN features extracted from NATIVE and LE contact maps (see Supplementary Figure 7).  Fig. 4). Fig. 4A shows a grad-CAM for a calcium ion binding (GO:0005509) of alpha-parvalbumin protein (PDB id: 1S3P). The two highest peaks in the grad-CAM correspond to the binding regions in the 3D structure of the protein (Fig. 4A, left). Indices of the calcium binding residues of the 1S3P protein were retrieved from the BioLiP database 55 and compared to the residues identified by our method. ROC curve computed between the binary profile representing binding sites from BioLiP (shown in red) and the gradCAM profile (shown in green) in Fig. 4A, right are depicted in Fig. 4B. High area under the ROC curve indicates high correspondence between binding sites and our predictions. Similar correspondence with BioLiP is observed for several other functions including DNA binding (GO:0003677), GTP binding (GO:0005525) and glutathione transferase activity (GO:0004364) (see Fig. 4C and their corresponding ROC curves in Fig. 4 B).
Our systematic analysis of grad-CAMs against BioLiP database reveal that the highest performing group of GO terms are related to functions with known site-specific mechanisms or site specific underpinnings, like metal binding. Therefore, we provide a systematic analysis of grad-CAMs for GO terms related to metal binding using also information from MetalPDB 56 database.
We depict examples (with high AUROC scores) where CAMs correctly identify binding regions for calcium ion binding (GO:0005509), zinc ion binding (GO:0008270) and copper ion binding (GO:0005507) (see Supplementary Figure).
There are also a large number of high quality protein structures in the PDB that lack functional annotation, or that have only high-level or incomplete annotation. This partial lack of annotation results from unbiased structural genomics projects, proteins having associations with processes but no function, and the fact that proteins can have multiple functions. Here we apply our method to the full PDB to 1) annotate unannotated chains, 2) complete partial annotations, and 3) look for new functions hiding in annotated proteins (find more moonlighting proteins). We present Supplementary

Discussion
In this work, we proposed a novel deep learning-based method for predicting protein function from both protein sequences and contact map representations of protein structures. Our method, trained on protein structures from PDB, is very efficient, and it can predict both GO terms and EC numbers of proteins and improves over state-of-the-art sequence-based methods on majority of GO terms especially. Features learned from protein sequences by the LSTM Language Model and from contact maps by the GCN lead to substantial improvements in protein function prediction accuracy, which could enable novel protein function discoveries. One important advantage of our method is that it makes function predictions that go beyond homology-based transfer by extracting local sequence and global structure features that would most likely be neglected by homology-based methods like BLAST (reflected in the substantial difference in the function prediction accuracy between our method and BLAST) 23 .
Comparable performance of our method between Rosetta-predicted and their corresponding experimentally determined structures, which can be attributed to high denoising power of our method, indicates that our method can also be reliably used in predicting functions of proteins with computationally inferred structures. This opens a door for characterizing many proteins lacking experimentally determined 3D structures and the contents of many databases with available predicted structures (e.g., homology-based Swiss-Model 12 , and ModBase 57 could be used for expanding the train set and improving predictive power of the model). The more extensive use of homology models allowed by the denoising properties of our network architecture will be a subject of future study.
While this paper mainly focuses on introducing efficient and accurate function prediction model, it also provides means of interpreting prediction results. We demonstrate, on multiple different GO terms, that CAMs identify structurally-meaningful protein regions encompassing functionally relevant residues (e.g., ligand-binding residues). For some PDB chains, the accuracy at which the CAM identifies binding residues is quite remarkable, especially given the fact the model is not principally designed to predict this, and that the ligand-binding information was not given to the model a priori. However, the main disadvantage of considering this to be a site-specific function prediction method is in the multiple different meanings of CAMs. Specifically, for some GO terms related to "binding", CAMs do not necessarily identify binding residues/regions; instead, they identify regions of residues that are conserved among the sequences annotated with the same function. The most interesting example demonstrating this property is maltose binding (GO:1901982) (see Supplementary Figure 8). In this example, the salient residues are far from the residues binding maltose in the 3D structure; but, by looking at a few non-redundant PDB sequences annotated with maltose binding, we find that the CAM always identifies the same residues that are conserved across the sequences. These can be explained with the fact that any neural network, including ours, would always tend to learn the most trivial features that lead to the highest accuracy ( AUPR=1 for maltose binding for both CNN and GCN) 58,59 . Despite these limitations, with the appropriate balancing and control of the bias in the training set (see Supplementary   Figure 9 showing the distribution of PDB chains belonging to different folds and how this correlates with grad-CAM performance), this approach has a huge potential in advancing site-specific function prediction.
After the culmination of much effort two key problems in computational biology, protein structure prediction and protein function protein, are linked together by the described methods.
Deep learning together with increasing amount of available sequence and structural data being generated each day has a potential to meet the annotation challenges posed by ever increasing volumes of genomic sequence, offering several new methods for interpreting protein biodiversity.

Methods
Construction of contact maps. We collect 3D atomic coordinates of proteins from the Protein Data Bank (PDB) 60 . As the PDB contains extensive redundancy in terms of both sequence and structure, we remove identical and similar sequences from our set of annotated PDB chains. We create a non-redundant set by selecting PDB chains that are not identical to any other PDB chain in the set. To do so, we first cluster all PDB chains (for which we were able to retrieve contact maps) by blastclust at 100% sequence identity (i.e., number of identical residues out of the total number of residues in the sequence alignment). Then, from each cluster we select a "representative" PDB chain as a PDB chain which is annotated (i.e., has at least one GO term in at least one of the 3 ontologies) and which is of high quality (has a high resolution structure). Each protein in the set is described by an ordered list of amino acid residues represented by thier X, Y and Z coordinates in angstrom (Å). To construct contact maps we use the α-carbon (Cα) atom type and consider two resides to be in contact if the distance between their corresponding Cα atoms is less than 10Å. We refer to this type of contact maps as Cα-Cα. We have also considered two other criteria for contact map construction. Two residues are in contact: 1) if the distance between any of their atoms is less than 6.5Å (we refer to this type of contact maps as ANY-ANY) and 2) if the distance between their Rosetta neighbor atoms is less than sum of the neighbor radii of the amino acid pair (we refer to this type of contact maps as NBR-NBR). Rosetta neighbor atoms are defined as the β-carbon (Cβ) for all amino acids except glycine where the α-carbon is used. An amino acids neighbor-radius describes a potential interaction sphere that would be swept out by the side amino acid side-chain as it samples all possible conformations. Neighbor-neighbor contact maps are therefore more indicative of side-chain-side-chain interactions than Cα-Cα maps. We have also experimented with different cut-off thresholds for Cα-Cα and AN Y − AN Y contact maps.
We found that our method produced the best results with Cα-Cα and 10Å cut-off.
Function annotations of PDB chains. In the training of our models we use two sets of function labels: 1) Gene Ontology (GO) 6 terms and 2) enzyme commission (EC) numbers 7 . GO terms are hierarchically organized into 3 different onotologies -molecular function (MF), biological process (BP) and cellular component (CC). We train our models to predict GO terms separately for each ontology. The summary of GO identifiers as well as EC numbers for each PDB chain were retrieved from SIFTS 61 (Structure integration with function, taxonomy and sequence) database. SIFTS transfers annotation to PDB chain level via residue-level mapping between UniProt Knowledgebase (UniProtKB) and PDB entries. All the annotation files were retrieved from SIFTS database with PDB release 08. 19 and UniPort release 2019.02. We consider annotations that are: 1) not electronically inferred (non-IEA), specifically, we consider GO terms with the following evidence codes: EXP, IDA, IPI, IMP, IGI, IEP, TAS and IC and 2) electronically inferred (IEA). Further-more, we focus only on specific MF-, BP-and CC-GO terms that have enough training examples from the non-redundant training set (see the section above). That is, we select only GO terms that annotate ≥ 30 (for MF and CC) and ≥ 50 (for BP) non-redundant PDB chains. We retrieved enzyme classes for sequences and PDB structures from the lowest level (most specific level) of EC tree. The number of GO terms and EC classes in each ontology is represented in Table [Supple-mentary Table 1].
Top 500 Rosetta-predicted structures. The initial set of benchmark structures used here was Jane and Dave Richardson's "Top 500" dataset 62 . It is a set of hand curated, high quality (the top 500 best), protein structures that were chosen for their fit to their completeness, how well they fit the experimental data, and lack of high energy structural outliers (bond angle and bond length deviations). This set has been used in the past for fitting Rosetta energy/score terms and numerous  Graph convolutional network. Graph Convolutional Networks (GCNs) have recently been shown to be powerful methods for extracting features from data that is naturally represented as one or more graphs 37 . This makes GCN a suitable candidate method for extracting features from a protein by taking into account their graph-based structure of amino acids represented by contact maps.
In particular, they have achieved a remarkable performance in classifying documents in citation networks 39 , modeling and predicting chemical properties of molecules 40,41,67 and protein interface prediction with applications in drug discovery and design 42 . Here, we propose our model based on the work of Kipf & Welling 39 . A protein graph can be represented by an adjacency matrix (also termed contact map), A ∈ R L×L , encoding connections between its L residues, and a residue-level feature matrix, X ∈ R L×c . We explore different residue-level feature representations including the one-hot encoding representation of residues as in the CNN (c = 26), LSTM language model (c = 512, i.e., the concatenated output forward and reverse LSTM layers), and no sequence features. 1 We refer to the last case as function prediction from protein fold only; see [Supplementary Figure 1].
The graph convolution takes both adjacency matrix, A and residue-level embeddings from the previous layer, H (l) ∈ R L×c l and outputs the residue-level embeddings in the next layer, H (l+1) ∈ R n×c l+1 : where H (0) = X, and c l and c l+1 are residue embedding dimensions for layers l and l + 1, respectively. Concretely, we use the formulation of Kipf & Welling 39 : where A = A + I L is the adjacency matrix with added self-connections represented by the identity matrix I L ∈ R L×L ; D is the diagonal degree matrix with entries D ii = L j=1 A ij , and W (l) ∈ R c l ×c l+1 is a trainable weight matrix for layer l + 1. To keep the residues' features on the same scale after every convolutional layer the adjacency matrix is first symmetrically normalized, hence We then use two dense layers to learn complex protein-to-function relations with ReLU activation function in the first layer and sigmoid (for predicting GO terms) and softmax (for predicting EC) activation function in the second layer. The second layer outputs probability vectorŷ of dimension |GO| for predicting probabilities of GO terms, and |EC| for predicting probabilities of EC classes.
Model training and hyperparameters. To account for imbalanced label problem, both CNN and GCN are trained to minimize weighted binary cross-entropy cost function that gives higher weights to GO term with fewer training examples: where Θ is the set of all parameters in all layers to be learned; w j = N  set is chosen to be no more than 30% sequence identical to the training set (and typically much less, or unalignable). We perform experiments with different thresholds. See Figure 2b showing performance of our method for different sequence identity thresholds. In all our experiments we trained on both non-IEA and IEA PDB chains (see Supplementary Figure 1), but the the test set, composed of only experimentally annotated PDB chains (non-IEA), is always kept fixed. See In all our experiments we train different models and the final results are averaged over predictions made by the 10 different models.
Residue-level annotations. We use a method based on Gradient-weighted Class Activation Map (grad-CAM) 44 to localize function predictions on a protein structure (i.e., to find residues with highest contribution to a specific function). grad-CAM is a class-discriminative localization technique that provides visual explanations for predictions made by CNN-based models. Motivated by its huge success in image analysis, we use grad-CAM to identify important, function-specific residues in a protein structure. In a grad-CAM approach, we first compute the contribution of each filter, k, in the last convolutional layer to the prediction of function label l by taking derivative of the output of the model for function l, y l , with respect to feature map F k ∈ R L over the whole sequence of length L: where w l k represent the importance of feature map k for predicting function l, obtained by summing the contributions from each individual residue. Finally, we obtain the function-specific heat-map in a residue space by making the weighted sum over all feature maps in the last convolutional layer: where ReLU function ensures that only features with positive influence on the functional label are preserved; CAM l [i] -indicates the relative importance of residue i to function l.
To account for variations in grad-CAM between different initializations of the same model architecture, we report the grad-CAMs averaged over an ensemble of 10 different separately trained models.
The advantage of grad-CAM is that it does not require re-training or changes in the architecture of the model which makes is computationally efficient and directly applicable to our models. Competing Interests The authors declare that they have no competing financial interests.
Correspondence Correspondence and requests for materials should be addressed to: vgligorijevic@flatironinstitute.org, rb133@nyu.edu