Hydrogen bonds meet self-attention, all you need for general-purpose protein structure embedding
Methods
- Neural network architectureis proposed to encode protein 3D structures based on local and global information.
- First, all pairs of residues connected by hydrogen bonds are extracted. A $k$ residue window along the backbone of each residue is added for contact. An MLP is then used to encode the corresponding carbon-alpha distance matrix into a fixed-size embedding.
- Once an embedding for each bond is computed, a transformer with all-to-all attention is used to pool the embeddings into a single global embedding.
- The model is trained end-to-end on the SCOP protein classification. That is, embeddings are trained to be able to predict the structural family of a given protein.
- Authors show that state of the art performance on SCOP classification is achieved, and that a good correlation with TM-score when using embeddings to retrieve structurally similar proteins is also achieved.
Comments
- The model still requires external supervision to work, so the claim that this is a ‘general purpose’ embedding is somewhat misleading. Transformer model attends to all pairs of hydrogen bonds, probably could take advantage of some sparsity there.