Structural Motif Search

Biochemical and biological functions of proteins are the product of both the overall fold of the polypeptide chain, and, typically, structural motifs made up of smaller numbers of amino acids constituting a catalytic center or a binding site that may be remote from one another in amino acid sequence. Detection of such structural motifs can provide valuable insights into the function(s) of previously uncharacterized proteins.

Technically, this remains an extremely challenging problem because of the size of the Protein Data Bank (PDB) archive. We have developed a new approach that uses an inverted index strategy capable of analyzing >200,000 PDB structures with unmatched speed. The efficiency of our inverted index method depends critically on identifying the small number of structures containing the query motif and ignoring most of the structures that are irrelevant. Our approach enables real-time retrieval and superposition of structural motifs, either extracted from a reference structure or uploaded by the user.

See it in Action

Structural motif searching is available as part of the RCSB Advanced Search and RCSB Mol* plugin. Help documentation is available.

Performance

Current benchmark times to search in 208,702 PDB structures and 1,068,577 AlphaFold/RoseTTAFold predictions as of 8/16/23, obtained on an instance with 6 cores and 64 GB memory. All structure data is held in memory, inverted index data is read from an SSD.

Allowing only Experimental/Archived Structures

Motif	Definition	Found Assemblies	'Paths' Time [ms]	'Score' Time [ms]	Total Time [ms]
Serine Protease	4cha - His:B-42, Asp:B-87, Ser:C-47	5,309	618	22	673
Aminopeptidase	1lap - Lys:A-250, Asp:A-255, Asp:A-273, Asp:A-332, Glu:A-334	91	158	1	181
Zinc Fingers	1g2f - Cys:F-7 His:F-25 His:F-29	739	135	3	160
Enolase Superfamily	2mnr - Lys:A-162, Asp:A-193, Glu:A-219, Glu:A-245, His:A-295	192	253	2	275
Enolase Superfamily (exchanges)	2mnr - Lys/His:A-162, Asp:A-193, Glu:A-219, Glu/Asp/Asn:A-245, His/Lys:A-295	210	2,996	14	3,032
RNA G-Quadruplex	3ibk - G:A-4, G:A-10, G:B-4, G:B-10	85	2,364	236	2,622

Including Computed Structure Models

Motif	Definition	Found Assemblies	'Paths' Time [ms]	'Score' Time [ms]	Total Time [ms]
Serine Protease	4cha - His:B-42, Asp:B-87, Ser:C-47	10,254	1,710	125	1,988
Aminopeptidase	1lap - Lys:A-250, Asp:A-255, Asp:A-273, Asp:A-332, Glu:A-334	647	352	7	389
Zinc Fingers	1g2f - Cys:F-7 His:F-25 His:F-29	9,442	492	92	686
Enolase Superfamily	2mnr - Lys:A-162, Asp:A-193, Glu:A-219, Glu:A-245, His:A-295	328	659	5	689
Enolase Superfamily (exchanges)	2mnr - Lys/His:A-162, Asp:A-193, Glu:A-219, Glu/Asp/Asn:A-245, His/Lys:A-295	350	5,246	25	5,296
RNA G-Quadruplex	3ibk - G:A-4, G:A-10, G:B-4, G:B-10	85	2,453	253	2,742

Search for all assemblies that contain hits with an RMSD <2 Å. 'Paths' refers to the time spent on inverted index operations, which identify all candidate structures that contain the motif. 'Score' refers to the time spent on aligning candidate structures to the query and computing RMSD values.

Computed structure models ignore unreliable regions with pLDDT <70.

Features

nucleotide support
inter-chain & assembly support
position-specific exchanges
modified residues
support for computed structure models, like from AlphaFold
detect motifs in a structure of interest

Getting Started with a Dependency

strucmotif-search is distributed by maven and supports Java 11+. To get started, append your pom.xml by:

<dependency>
  <groupId>org.rcsb</groupId>
  <artifactId>strucmotif-search</artifactId>
  <version>0.19.6</version>
</dependency>

Getting Started by Cloning

An alternative way to use the library is cloning this repository and building the corresponding Maven modules.

Search for Similar Structures by A Single Motif

The Strucmotif class provides a fluent API to process structural motif queries.

Strucmotif.searchForStructures()
        // several ways can be used to define the query motif - e.g., specify a PDB entry id
        .defineByPdbIdAndSelection("4cha",
                // and a collection of sequence positions to extract residues to use as motif
                List.of(new LabelSelection("B", "1", 42), // HIS
                        new LabelSelection("B", "1", 87), // ASP
                        new LabelSelection("C", "1", 47))) // SER
        .rmsdCutoff(1.0)
        .buildParameters()
        .buildContext()
        .run()
        .getHits()
        .stream()
        .map(hit -> hit.structureIdentifier() + "_" + 
                    hit.assemblyIdentifier() + " @ " + 
                    hit.labelSelections() + " - RMSD: " +
                    hit.rmsd())
        .forEach(System.out::println);

Detect if a Structure Contains Motifs of Interest

This process can also be reversed to detect whether a structure of unknown function contains characteristic motifs.

// acquire a collection of motifs to screen for
Set<EnrichedMotifDefinition> motifs = Strucmotif.getMotifDefinitionRegistry().getEnrichedMotifDefinitions();

Strucmotif.detectMotifs()
        .defineByPdbIdAndAssemblyId("2mnr", "1")
        .withMotifs(motifs)
        .rmsdCutoff(1.0)
        .buildParameters()
        .buildContext()
        .run()
        .getHits()
        .stream()
        .map(hit -> hit.motifIdentifier() + " @ " +
                    hit.labelSelections() + " - RMSD: " +
                    hit.rmsd())
        .forEach(System.out::println);

Configuration

Property	Action	Default Value/Behavior
`ccd-url`	URL to the chemical component dictionary	wwPDB
`decimal-places-score`	Number of decimal places reported for scores	`2`
`decimal-places-matrix`	Number of decimal places reported in transformation matrices	`3`
`in-memory-strategy`	Preload structure data for increased performance?	`off`
`loading-chunk-size`	Batch size when holding structure data in memory	`200,000`
`max-results`	Maximum number of results that will be returned	`50,000`
`max-motif-size`	Maximum number of residues that may define a motif	`10`
`per-query-threads`	Number of worker threads per query	available processors
`query-timeout`	Interrupt queries after `n` milliseconds	`none`
`root-path`	Path where data files will be written	`/opt/data/`

Configure by placing your application.properties on the classpath. All properties specific to this project must be prefixed with strucmotif..

Index Structure Data and Run Updates

You will need to process your corpus of structure data before using the service. This will create an optimized version of all structure files and add them to an inverted index that allows efficient searching.

Details can be found in: UPDATE.md

Implementation Details

Addressing Residues

Two address schemes exist. LabelSelection is a high-level, object-based way of referencing individual residues. It uses a combination of mmCIF properties, namely label_asym_id, struct_oper_id, and label_seq_id:

LabelSelection ref = new LabelSelection("A", "1", 123);

Internally, access is facilitated using 32-bit unsigned primitive encoded integers. It doesn't follow any particular layout rather, all encountered residues are addressed by their index. Chain boundaries are ignored. Operations required for assemblies are honored as they occur in the source file and merely increment the counter. Additional work is done to preserve information on chains and assemblies. Chain and operator names as well as boundaries are stored in memory and can be used to reconstruct LabelSelection instances if needed.

Residue pairs are identified by pairs of these int values. They can be stored as long value by chaining together 1st and 2nd value.

Residue Pair Descriptor

Residue pair descriptors capture the label_comp_id of both interacting residue, their backbone distance, their side-chain distance, and the angle defined between both.

These values are the Cartesian product of ResidueType (A, 36 states, 6 bits) x ResidueType (B, 36 states, 6 bits) x DistanceType (C, 32 states, 5 bits) x DistanceType (D, 32 states, 5 bits) x AngleType (E, 10 states, 4 bits) and are stored in an unsigned 32-bit integer. The 32-bit descriptors will use their 4th bit to store metadata (M) that tracks whether the identifier is flipped.

XXXMAAAA AABBBBBB XXCCCCCD DDDDEEEE

A second flavor exists that only tracks DistanceType x DistanceType x AngleType and can be held in an unsigned 16-bit short.

XXCCCCCD DDDDEEEE

Convenience functions to work with these descriptors are provided in the ResiduePairDescriptor class.

Related Projects

ciftools-java: mmCIF parsing and BinaryCIF implementation
ffindex-java: bundle large amounts of small files together
rcsb-molstar: define motifs in 3D and visualize results

Publication

Bittrich S, Burley SK, Rose AS (2020) Real-time structural motif searching in proteins using an inverted index strategy. PLoS Comput Biol 16(12): e1008502. https://doi.org/10.1371/journal.pcbi.1008502

Name		Name	Last commit message	Last commit date
Latest commit History 961 Commits
.github/workflows		.github/workflows
.idea		.idea
strucmotif-search-benchmark		strucmotif-search-benchmark
strucmotif-search-core		strucmotif-search-core
strucmotif-search-update		strucmotif-search-update
.gitattributes		.gitattributes
.gitignore		.gitignore
CHANGELOG.md		CHANGELOG.md
LICENSE.md		LICENSE.md
README.md		README.md
motifs.png		motifs.png
pom.xml		pom.xml
pom.xml.releaseBackup		pom.xml.releaseBackup
release.properties		release.properties

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Structural Motif Search

See it in Action

Performance

Allowing only Experimental/Archived Structures

Including Computed Structure Models

Features

Getting Started with a Dependency

Getting Started by Cloning

Search for Similar Structures by A Single Motif

Detect if a Structure Contains Motifs of Interest

Configuration

Index Structure Data and Run Updates

Implementation Details

Addressing Residues

Residue Pair Descriptor

Related Projects

Publication

About

Releases

Packages

Contributors 2

Languages

License

rcsb/strucmotif-search

Folders and files

Latest commit

History

Repository files navigation

Structural Motif Search

See it in Action

Performance

Allowing only Experimental/Archived Structures

Including Computed Structure Models

Features

Getting Started with a Dependency

Getting Started by Cloning

Search for Similar Structures by A Single Motif

Detect if a Structure Contains Motifs of Interest

Configuration

Index Structure Data and Run Updates

Implementation Details

Addressing Residues

Residue Pair Descriptor

Related Projects

Publication

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages