Non-coding RNAs have generated great interest in recent years. Many of their functionalities are so important that our understanding of living organisms has already changed. The functionalities of non-coding RNAs depend mainly on their structures, thus extensive efforts have been made to analyse the structure of these RNAs. Within an RNA structure, many ‘subcomponents’ are recurrent; these are referred to as ‘RNA structural motifs’. These motifs usually share conserved structural and functional characteristics, and can act as a starting point for the investigation of the whole RNA molecule. Here, we survey a number of computational tools for the automatic identification of RNA structural motifs from three-dimensional RNA structures. We anticipate that the information will help readers make the best choice of tools for their own applications.
by Shaojie Zhang and Cuncong Zhong
RNA structural motifs, building blocks of the RNA architecture
The word ‘motif’ usually refers to recurrent fragments or patterns. For example, the term protein motif refers to the pattern of 3D arrangement of amino acids. Similarly, the RNA structural motif represents the pattern of arrangement of nucleotides. The recurrence of these patterns indicates evolutionary conservation as well as functional importance. While the three-dimensional (3D) structure of RNA is usually very complex and difficult to analyse, RNA structural motifs are highly modularised components, whose abundance and arrangements can be used to characterise the functionalities of the RNA. Computational identification of RNA structural motifs is therefore a critical step towards RNA structural analysis.
As we know, an RNA is a chain of nucleotides, and each nucleotide contains a nitrogenous base, a ribose sugar and a phosphate. The phosphate radical connects two consecutive nucleotides by attaching two ribose sugars through phosphodiester bonds. One of the major contributions to the formation of the diversified RNA structures is from the hydrogen bonds between bases, and the two interacting bases are called a base-pair. Each base has three edges, namely the Watson-Crick edge, Hoogsteen edge and Sugar edge, and each of the three edges can be used to form base-pairs. In DNA, interactions between Watson-Crick edges dominate and result in the well-known double-helix structure; the base-pairs that form regular double-helix structures are called canonical base-pairs. In RNA, interactions are more versatile and can exist between any edges of the bases, which include canonical base-pairs and non-canonical base-pairs. Leontis et al. studied non-canonical base-pairs and developed the associated isostericity matrices for the substitutions between any base-pairs [1]. It was shown that base-pairs within the same isosteric group can interchange with each other without altering the overall RNA structure. Another force for maintaining the RNA structure is called base stacking. Considering the base as a circle on a plane, the surrounding electrons cause a magnetic force that is perpendicular to the plane. The direction of stacking of the base-pairs can thus affect the stability of the local structure.
In this review, we discuss a number of recently developed RNA structural motif identification tools. We have organised these tools into two categories, namely the geometry-based scheme and the base-pairing pattern-based scheme [Figure 1]. We introduce these tools by summarising the strategies for modeling, representing and manipulating the RNA structural motifs. The tools available on-line are summarised in Table 1.
Searching RNA structural motifs through geometric properties
The most intuitive way to search for a given RNA structural motif is through its geometric properties. Ideally, given 3D information of an RNA structural motif as the query, we want to find the substructures in the target RNA structure that are geometrically identical to the query. However, a practical issue that has arisen is how to extract the 3D information of these structural motifs. Some tools focus on the interacting bases while others mainly consider the backbone conformation. Implementation is discussed in the following sections.
NASSAM (Nucleic Acids Search for Substructures And Motifs) [2]. NASSAM uses the pairwise distances between the key atoms within an RNA structural motif to model the base-pairing interactions. The key atoms are fixed for each type of base, for instance, purines are represented by the N9, N7 atoms and C6, C2 functional groups. The intra-base key atom distances are used to uniquely represent the base, while the inter-base key atom distances are used to model the base-pairing geometric property. The pairwise distances are summarised in a matrix, and Ullmann’s subgraph isomorphism algorithm is applied to identify related motifs. NASSAM was the very first tool devoted to RNA structural motif identification; it provides a general scheme for successive geometry-based applications. However, the major focus of NASSAM is to study specific base-pair properties (for example, A-platforms and uncommon U-U interactions). Improvements are still required for application on larger RNA structural motifs.
PRIMOS (Probing RNA structures to Identify Motifs and Overall Structural changes) [3]. The backbone of an RNA structure is the sugar chain joined by phosphodiester bonds. For this the representation can be very concise since complicated base-base interactions are excluded. PRIMOS considers the torsion angles at the linkages of the backbone. The backbone structure of an RNA chain is thus summarised by a series of torsion angles. The trajectory generated by following this path is considered as the signature of the motif and is used to search against the candidates. The idea of PRIMOS has been extended to de novo motif discovery and implemented as COMPADRES (Comparative Algorithm to Discover Recurring Elements of Structure) [4]. The entire RNA structure is represented using the torsion angles and regions within the molecule are compared with each other. The recurrent regions resulting from these self-comparisons are extracted as hypothesised motifs.
ARTS (Alignment of RNA Tertiary Structures) [5]. ARTS uses the 3D information from the phosphorous atoms to represent the backbone conformation. ARTS takes a base-pair as a unit, and the helical backbone can be viewed as a continuous stack of base-pairs. ARTS first identifies a number of well-matched pairs of base-pairs and determines the corresponding transformation between the coordinate system that results in the match. These matched base-pairs are taken as the seeds for extension. Adjacent base-pairs are attached to the seed if they are spatially close to each other. In addition to the motif search, ARTS is able to align any pair of RNA tertiary structures. Another advantage worth noting is that ARTS can approximately handle nucleotide insertions and deletions: during the extension stage, the newly attached base-pairs do not need to be directly linked to the seed.
FR3D (Find RNA 3D) [6]. FR3D takes into account both geometric information from pairing bases and their interacting patterns. FR3D represents each base by its geometric centre. A motif is represented by all pairwise distances between geometric centres of two bases. FR3D then tries to fit the query motif with the candidate and record the pairwise distance discrepancy. After fitting the geometric centres, for each pair of matched bases, the minimum rotation angle which best superimposes the two bases is computed. The overall closeness of fit is measured by both the fitting error and orientation error. This fit is used to prioritise the candidates and the best among them are selected. FR3D also supports symbolic search, meaning that the user can add certain constraints to the base interacting pattern (for example, the 4th and 6th bases are stacked inwards or they are paired through their Hoogsteen edges). FR3D can be used to study composite motifs that involve more than two strands. It is therefore likely that many new occurrences that are imbedded in complicated regions inside an RNA structure will be discovered. FR3D also provides a friendly user interface, and new features are still being added to this tool, such as a base-phosphate interactions feature.
The shape histogram method [7]. The shape histogram modeling considers both a reduced backbone conformation and the geometry of pairing bases. A shape histogram of an RNA structural motif is constructed using the following steps. First, the geometric centroid of the entire motif is computed. Then the Euclidian distance between the centroid and each atom within the motif is computed. The Euclidian distances are represented by the shape histogram, where the x-axis represents the distance, and the y-axis represents the number of backbone atoms that have such distance from the centroid. The shape histogram is considered as the ‘fingerprint’ of the motif and can be used to search against the candidates for motif occurrences. Although why the shape histogram is advantageous remains unclear, the simple and robust representation of the RNA structural motif makes it convenient for applications involving complex motifs, such as RNA junction regions. The prototype software for the shape histogram method is available by contacting the corresponding author.
Searching RNA structural motifs through base-pairing patterns
The geometry-based methods can only be used to identify motifs with relatively rigid topologies that can be significantly altered by even a single base-pair insertion/deletion. Besides, sophisticated modeling derived from heuristics is usually applied to perform the search, rather than rigorously-defined algorithms. The performances are usually empirically assessed, without theoretical guarantees. Base interaction patterns, especially the base-pairing patterns of the RNA structural motifs, have proved useful for modeling purposes. The definition of base-pair isostericity implies that the base-pairing patterns can largely determine the geometry. It is thus possible to consider solely the base-pairing patterns for the motif search. In the following examples, we discuss how the base-pairing patterns are used to model and search for RNA structural motifs.
Graph modeling of RNA structural motifs [8]. The entire base-interacting pattern of an RNA structural motif can be presented as a graph, where a nucleotide can be either paired or stacked to other base-pairs. Each interaction is labeled with a certain property, i.e., the pairing edges or stacking orientations. Such a graph can be used to represent the RNA structural motif. This modeling has been successfully applied on the sarcin-ricin motif. Although the idea is simple, it is a proof of concept that the base-interacting pattern can determine 3D structure. This idea is not implemented in a general RNA structural motif search tool. Interestingly, it is applied to MC-Fold to predict RNA 3D structures from sequences.
Automatic RNA structural motif clustering [9]. A similar idea has also been applied for RNA structural motif clustering. This conducts an all-against-all comparison of candidate motifs and uses hierarchical clustering to identify groups of closely related motif families. In this application, the comparison between base-pairing patterns is ‘local’, meaning that the most similar subcomponents between the two graphs are found.
RNAMotifScan [10]. RNAMotifScan was developed to facilitate the search of RNA structural motifs using base-pairing patterns. RNAMotifScan removes the stacking interaction so that the graph modeling of an RNA structural motif can be reduced to a tree representation. A dynamic programming-based algorithm is then used to find the optimal matching between the query and candidate motifs, with consideration of insertions and deletions of nucleotides. Two important issues are considered under the current base-pairing modeling scheme. Firstly, the base-pair similarities need to be quantified in order to reflect the corresponding geometric changes. Secondly, multi-pairs (such as base-triples) need to be considered because of their abundance in RNA structural motifs. Isosterisity is incorporated into the alignment through the substitution scoring function for base-pairs. Intuitively isosteric base-pairs can preserve 3D geometric information, and a bonus is given for matching information. In order to consider multi-pairs, all base-pairs are classified into different relationship groups such that only base-pairs within the same group can match. The restriction guarantees that a multi-paired base can only be matched with another multi-paired base with the same pairing configuration. RNAMotifScan is the first RNA structural motif search tool that only utilises base-pairing information. It demonstrates improved performance over most existing methods, and at the same time provides higher efficiency because the information searched is reduced. It is capable of searching a motif against the entire Protein Data Bank within two hours. The results provide valuable information for estimating the distributions of RNA structural motifs among different kinds of structured RNAs.
Summary
In general, the geometry-based RNA motif search tools are more specific, and the base-pairing pattern-based RNA motif search tools are more sensitive. Computationally, geometry-based tools usually use heuristics, while base-pairing pattern-based tools can apply rigorous algorithms for the search. In terms of speed, the base-pairing pattern-based tools usually run faster because of their simplified modeling.
It is difficult to conclude which method is preferable for RNA structural motif identification, since most of these computational tools cater for different RNA motif families. Users should first consider how their motif of interest is defined before choosing which tool to use. For instance, some motifs are defined by their most distinct geometric property (for example, kink-turn is characterised by its ‘kink’ and the turn between two helical regions, see Figure 1), while the others are defined by their base-pairing patterns (for example, the tandem sheared motif is defined by three consecutive sheared base-pairs). The size of the query motif should also be considered. When the motif of interest is fairly small (empirically, less than 6 nucleotides), the user may want to choose the more specific geometrybased tools to refine the results. On the other hand, if the query motif is large (more than 6 nucleotides), base-pairing pattern-based tools should also be tried to find more interesting hits.
References
1. Leontis NB, Stombaugh J and Westhof E. The non-Watson-Crick base-pairs and their associated isostericity matrices. Nucleic Acids Res 2002; 30(16): 3497-3531.
2. Harrison AM et al. Representation, searching and discovery of patterns of bases in complex RNA structures. J Comput Aided Mol Des 2003; 17(8): 537-49.
3. Duarte CM, Wadley LM and Pyle AM. RNA structure comparison, motif search and discovery using a reduced representation of RNA conformational space. Nucleic Acids Res 2003; 31(16): 4755-4761.
4. Wadley LM and Pyle AM. The identification of novel RNA structural motifs using COMPADRES: an automated approach to structural discovery. Nucleic Acids Res 2004; 32(22): 6650-6659.
5. Dror O, Nussinov R and Wolfson H. ARTS: alignment of RNA tertiary structures. Bioinformatics 2005; 21 Suppl 2: ii47-53.
6. Sarver M et al. FR3D: finding local and composite recurrent structural motifs in RNA 3D structures. J Math Biol 2008; 56(1-2): 215-52.
7. Apostolico A et al. Finding 3D motifs in ribosomal RNA structures. Nucleic Acids Res 2009; 37(4): e29.
8. St-Onge K et al. Modeling RNA tertiary structure motifs by graph-grammars. Nucleic Acids Res 2007; 35(5): 1726-36.
9. Djelloul M and Denise A. Automated motif extraction and classification in RNA tertiary structures. RNA 2008; 14(12): 2489-97.
10. Zhong C, Tang H and Zhang S. RNAMotifScan: automatic identification of RNA structural motifs using secondary structural alignment. Nucleic Acids Res 2010; 38(e176).
The author
Shaojie Zhang1 and Cuncong Zhong2
Department of Electrical Engineering and Computer Science
University of Central Florida
Orlando, FL 32816-2362,
USA
1 Corresponding author
Tel+ 1-407-823-6095,
e-mail: shzhang@eecs.ucf.edu
2 e-mail: cczhong@eecs.ucf.edu