A guanine tetrad assumes a coplanar array of a four-stranded square known as G-quadruplex (G4). It is formed by guanine tetrads piled upon one another which is the consequence of four stretches of 3 or more guanines (Figure 3 & Table 1). Facts collected from the biophysical experiments regarding the formation of G4s led to the development of a number ofW?1 different algorithms.
G4W?2 sequence is basicallyW?3 formed by the four stretches of repeated Gs (G-runs) and three regions of nucleotide sequence that interconnect G-runs (loops). G4 motif identification is based on the general formula that recognizes the following pattern: G+N?G+N?G+N?G+. In this formulaW?4 ‘N’ specifies an arbitrary base that could or could not be G, and ‘+’ means minimum one repeat of the previous symbol and ‘?’ indicates minimum zero repeats. In the beginningW?5 G4 motif hunting was primarily done by the regular-expression based pattern-matching approaches (Cao et al., 2012; Huppert & Balasubramanian, 2005, 2007; Rawal et al., 2006; Todd et al.
, 2005). A list of potential G4s was generated by taking all combinations of loop length (1–7 nt) andW?6 guanine run length (3–5 nt), and then each entry was searched against the total genomic sequence (Todd et al., 2005) (Table 2). Quadparser tool (Huppert & Balasubramanian, 2005; Huppert & Balasubramanian, 2007) was created using the method that looks for four or more runs of guanines by scanning query sequence (Table 2).
As a similar approach, the potential G4s were predicted by calculating the frequency of each quadruplex in G-tracts (region of multiple quadruplexW?7 ) after scanning the query sequence for four or more G-runs (Rawal et al., 2006) (Table 2). In G4 calculator, the proportion of G-runs is calculated in a sliding windowsW?8 that contains four GGG runs, and describe this as the G4P for a region (Table 2; Eddy & Maizels, 2006). Web-based tools for G-quadruplex prediction are also available; QGRS Mapper (http://bioinformatics.ramapo.edu/QGRS/) (Kikin et al., 2006) and GRSDB (http://bioinformatics.
ramapo.edu/grsdb/) (Kostadinov et al., 2006) (Table 2).
For the analyses of G4 sequences using these tools, users needW?9 upload the sequenceW?10 on the web or retrieve the sequences through the NCBI Gene Entrez RefSeq database to search for the presence of G4s in their target sequences. Although the pattern matching methods are able toW?11 identify bulks of G4 motifs in genomic sequences fast, most of them are false positive G4 sequences (Beaudoin & Perreault, 2010; Lorenz et al., 2012). To search for G4 by overcoming problems in false positive predictions, a probabilistic model based on HMM was proposed (Yano & Kato, 2014) (Table 2). Four HMM-based models were devised by calculating the number of hidden states that depict G-runs and loops which model G4 motifs. To train the G4 HMMs, experimentally verified data in the literature was used (Stegle et al., 2009).
ExistenceW?12 of imperfect G4s was confirmed not only from in vitro experiments (Mukundan & Phan, 2013) but also from in silico simulations and molecular dynamics (Varizhuk et al., 2017) since 2013. Then new algorithms that identify imperfect G4s such as TetraplexFinder/QuadBase2 (Dhapola & Chowdhury, 2016), G4Hunter (Bedrat et al., 2016) and ImGQfinder (Anna Varizhuk et al., 2014) (Table 2) were established. For example, TetraplexFinder scans the runs of three guanineW?13 for bulges of user definedW?14 length.
ImGQfinder searches for the presence of a single bulge or mismatch in a guanine run of flexible length. G4Hunter does not specify defect type,W?15 but can hold different types of defects using encoding and statistics over a sliding window. In 2017 new algorithm pqsfinder was put forward by a new study (Hon et al, 2017)W?16 . The algorithm has not only identifies imperfect G4 prediction but also calculates score as a measure of its stability, presenceW?17 of the extensive long loops, and the density of G4s at each position, and thus allowing userW?18 to define criteria for matching and scoring by overriding the default values (Table 2). This algorithm can be divided into three logical steps: (i) identification of all possible perfect/imperfect G run quartets, (ii) score assignment and (iii) overlap resolution (Hon et al.