Evolutionary expansion of nematode-specific glycine-rich secreted peptides.

Genome-wide comparisons across 10 species from algae Guillardia theta to mammal human indicated that Caenorhabditis elegans and Caenorhabditis briggsae were highly enriched for glycine-rich secreted peptides (GRSPs) (110 GRSPs in C. elegans and 93 in C. briggsae) in this study. Chromosomal mapping showed that most GRSPs were clustered on the two nematode genomes [103 (93.64%) in C. elegans and 82 (88.17%) in C. briggsae], which could be divided into 18 cluster units in C. elegans and 13 in C. briggsae, respectively. Except for four C. elegans GRSPs clusters without matching clusters in C. briggsae, all other GRSPs clusters had paired synteny block between the two nematode genomes. Analyzing transcriptome datasets quantified by microarray indicated extensive genome-wide co-expression of GRSPs clusters after C. elegans infections. Highly homologous coding sequences and conserved exon-intron structures indicated that GRSPs tight clusters were likely derived from local DNA duplications. Phylogenetic conservation of synteny blocks between their genomes, co-expression of GRSPs clusters after C. elegans infections, and strong purifying selections of coding sequences may indicate evolutionary constraints acting on C. elegans to guarantee that C. elegans could mount rapid systematic responses to infections by co-expression, co-regulation, and co-functionality of GRSPs clusters.


Introduction
According to the primary structure, glycine-rich proteins can be classified into two classes: (1) consisting of large glycine-rich proteins (GRPs >200 AA) with a length of over 200 amino acids that typically function as cell wall structural components and (2) composed of small glycine-rich secreted peptides (GRSPs, <200 AA) that have a typical signal peptide followed by a mature peptide with a high glycine content. GRSPs represent a class of unique effectors of multicellular organisms, possessing relatively simple structures but exhibiting complex biological functions. According to previous research, almost all animals, plants, and microorganisms are enriched with GRPs, such as glycine-rich cold-induced proteins from zebrafish [1], glycine-rich keratin and keratin-associated proteins from 22 mammal genomes [2] and RNAbinding proteins with C-terminal glycine-rich domain from Arabidopsis thaliana [3]. Plant GRPs have shown diverse functions, including cell wall structure, plant defense, oleosin GRPs in pollen hydration and competition, extracellular ligands of kinase proteins, and RNA-binding GRPs in osmotic stress and cold stress [4]. Growing evidence suggests that these proteins play key roles in the adaptation of organisms to biotic and abiotic stresses including those resulting from pathogenesis, alterations in the osmotic, saline, and oxidative environment, and changes in temperature [3].
To our knowledge, total GRSPs encoded by genomes of different species are significantly distinct. GRSPs are enriched in some species, whereas in other species, no GRSPs have been identified. Caenorhabditis elegans and Caenorhabditis briggsae are highly enriched for GRSPs in this study. With relatively simple structures but complex biological functions, the importance of GRSPs in nematodes is highlighted by the observations that many members in the GRSP family were indicated to play important roles in C. elegans innate immunity. For example, nlp-29 and cnc-2 in the GRSP family were upregulated after Serratia marcescens infection of C. elegans [5]. Nlp-29 and nlp-31 in GRSP family were differentially expressed in response to fungal and bacterial infection [6]. Six members in GRSP family from nlp-27 to nlp-31 and grsp-2 were upregulated after Drechmeria coniospora infection of C. elegans in vivo [7]. Expression of the family member grsp-21 was upregulated twofold in response to Microbacterium nematophilum [8]. Evolutionary diversification of these GRSPs may enhance anti-fungal innate immunity of C. elegans [7]. Although these GRSPs are important for C. elegans innate immunity, we could not find its corresponding orthologs in human genome. As soil organisms and bacterial feeders, nematodes were constantly challenged by all the different species of soil bacteria, fungi, and other microbes, which have been driving the evolution of nematodes. We were impressed by published works about members of the GRSP family in immune responses of C. elegans and interested in knowing whether there were more GRSPs in nematodes and how GRSPs responded to C. elegans infections. We believed that free-living soil nematodes very likely to have developed unique components to adapt to the unique environment.
The importance of GRSP family in nematodes is further stressed by the fact that expression of certain GRSPs of C. elegans was upregulated by Gram−, Gram+, and fungi of natural infection. Supported by the above facts, we believed in the existence of additional GRSPs and hypothesized that analyzing the genomic sequence would identify novel GRSPs and provide a new global view of GRSP evolution in nematodes. To have a general knowledge of the two nematodes, in the present work, we particularly focused on (1) genome-wide identification and classification of GRSPs which would provide a global view of GRSPs evolution in the two nematodes, (2) mapping these GRSPs on their genomes which would provide a global view of GRSPs distributions on their chromosomes, (3) phylogenetic analyses based on signal peptides of the two nematode GRSPs, and (4) integrated analysis of public transcriptome datasets about C. elegans infections would gain insights into the role of C. elegans GRSPs in innate immune.

Identification of GRSPs in the two nematode genomes
Comprehensive comparison of GRSPs was conducted across 10 species of genomes: Homo sapiens, Danio rerio, Drosophila melanogaster, C. elegans, C. briggsae, A. thaliana, Monosiga brevicollis, Saccharomyces cerevisiae, Dictyostelium discoideum, and G. theta. Genome-wide protein sequences of the 10 species were downloaded from the UCSC database (https://genome.ucsc.edu/), and it used to construct two local protein sequence databases. Local-Blastp and PSI-Blast programs from NCBI were carried out to identify C. elegans GRSPs with the previously identified GRSPs: nlp-29, nlp-31, nlp-33, cnc-2, cnc-4, and cnc-6 as initial queries. GRSPs of C. briggsae were identified by using all C. elegans GRSPs as initiation queries.

C. elegans GRSPs expression at transcriptional level
Gene expression omnibus (GEO) data sets in NCBI (http://www.ncbi.nlm.nih.gov/) and the reads of RNA sequencing project (PRJNA33023) in DRASearch (https://trace.ddbj.nig.ac.jp/ DRASearch/) were used to confirm the transcriptional expression of C. elegans GRSPs and avoid false positive arising from genome annotation. This RNA sequencing project is a component of the C. elegans modENCODE project including 308 SRA experiments and 196 Biosamples. The total number of genes on each chromosome of C. elegans was obtained from UCSC (WS220/ce10) for the estimate of GRSPs density on each chromosome.

Mapping GRSPs to the genomes of the two nematodes
Characteristic parameters of GRSPs were obtained from WormBase (https://www.wormbase.org/). Configuration files were generated, and mapping of GRSPs to the genomes was performed by Circos [9]. Spacing was based on chromosomal units and the results were further manually modified for easier identification. Orthologous pairs were determined by the twoway reciprocal "best hits" and combining sequence similarity-and synteny-based approaches. Orthologous GRSPs pairs were mapped to their genomes and connected across their chromosomal maps by straight line to identify conserved orthologous synteny blocks of the two nematode genomes.

Transcriptomic analysis of C. elegans GRSPs following infection
Eight transcriptomic data sets related to C. elegans infections quantified by Affymetrix microarray (GSE20053, E-MEXP-696, GSE27867, GSE54212, GSE53732, GSE41058, GSE37266, and GSE2740) were downloaded from NCBI GEO database. Differentially expressed GRSPs were extracted to analyze using the GEO2R tool in the GEO database. The range of co-expression clusters of C. elegans GRSPs was defined to be less than 500 kb. Due to the limited data sets of C. briggsae genome, we failed to confirm transcriptional expression of C. briggsae GRSPs to estimate GRSPs density on its chromosomes and to analyze the co-expressed C. briggsae GRSPs after infections.

Phylogenetic and evolutionary analysis
With the signal peptide sequences of the two nematode GRSPs, a phylogenetic tree was built to detect how the nematode GRSPs families had evolved by gene duplication by using the program Molecular Evolutionary Genetics Analysis package version 6 (MEGA 6) [10]. The bootstrap consensus tree inferred from 500 replicates was taken to represent the evolutionary history to assess the reliability of the phylogenetic tree using the neighbor-joining (NJ) method under p distance [11]. All sites bearing alignment gaps and missing information were retained initially, excluding them as necessary using the pairwise deletion option.

Analysis of the nucleotide sequences
Using MEGA 6, we estimated transition (Ti)/transversion (Tv) ratios (R) among nucleotides, the number of synonymous (dS) and nonsynonymous (dN) substitutions per site, and the codon-based Z-test for purifying selection. The program was operated under the model of the modified Nei-Gojobori (assumed Ti/Tv bias = 2,2) methods to calculate the difference of dN-dS, and the values were estimated by standard errors (SE) by the bootstrap methods (800 replicates; seed = 17,114) (for details, please refer to supplementary materials and methods in [12]).

Genome-wide analysis of GRSPs across 10 species
The number of GRSPs in each genome of the 10 species was 4 for human, 6 for zebrafish, 53 for fruit fly, 110 for C. elegans, 93 for C. briggsae, 52 for A. thaliana, 0 for M. brevicollis, 0 for S. cerevisiae, 5 for D. discoideum, and 0 for G.theta. The two nematodes (110 for C. elegans and 93 for C. briggsae) are extremely enriched with GRSPs in this study. Analysis of C. elegans GRSPs in these species revealed that the number of twoway reciprocal "best hit" orthologs was respectively 0, 2, 8, 90, 3, 0, 0, 2, and 0 (Table 1) [12]. Few matching orthologs of C. elegans GRSPs in the other species may indicate that GRSPs were less vertically inherited. Besides the two nematodes, D. melanogaster and A. thaliana are also enriched for GRSPs when compared to the other species analyzed here, which may indicate that an evolutionary expansion of GRSPs happened in nematodes, arthropods, and plants over evolutionary adaption and speciation.

Identification and classification of the two nematode GRSPs
Based on sequence similarity and the conservation of intron position and phase, 203 GRSPs of the two nematodes were classified into 17 subfamilies (for details, please refer to Figure S1 and S2 in [12]). GRSPs mature peptides are enriched for glycine with content ranging from 17 to 74% (For details, please refer to Table S3 in [12]). 62 GRSPs (30.54%) with glycine content from 30 to 40% are the most abundant (Figure 1). Among 110 C. elegans GRSPs, 36, 11, 14, and 2 have already been designated as "fungus-induced protein related" (FIPR) or "fungus-induced protein" (FIP), "Caenorhabditis bacteriocin" (CNC), "neuropeptide-like protein" (NLP), and "DAF- 16  ) on chromosome V, and 6 (6.45%) on chromosome X. Comparing S1B to S1C showed that the distribution ratio of GRSPs on its corresponding chromosomes of the two nematodes is similar.
Evolutionary Expansion of Nematode-Specific Glycine-Rich Secreted Peptides http://dx.doi.org/10.5772/intechopen.68621 the following shared characteristics: (1) a typical signal peptide located at the N-terminus, (2) a precursor peptide with less than 200 AA, (3) a predicted mature peptide with high glycine contents, and (4) by comparison with the three members (NP_001024238, NP_501117, and NP_504970) already named as GRSPs (grsp-1, grsp-3, and grsp-4 in public database), we designated the other 47 unnamed peptides as GRSPs by these criteria. GRSPs identified in C. briggsae were referred to as "Cbr," representing the first three letters of the species name C. briggsae, plus the name of the corresponding orthologs in C. elegans following the previous study [7]. Except for Cbr-grsp-32, all the other C. briggsae GRSPs have its corresponding orthologs in C. elegans. The number of FIPR or FIP, CNC, NLP, and GRSPs family members in C. briggsae is, respectively, 31, 9, 12, and 41 (for details, please refer to Table S1 in [12]).

The evidence of transcriptional expression of C. elegans GRSPs
Highly homologous GRSPs are usually clustered together on the two nematode genomes. This is exemplified by GRSPs from fipr-3 to fipr-9 clustered on C. elegans chromosome V. Their percent identity of protein-coding sequence ranges from 86.1 to 100% (for details, please refer to Figure S4 in [12]). It is notorious that many short genes enriched for repeat sequences are frequently incorrect in genome annotation. To avoid false positive resulting from genome annotation, we further verified the transcriptional expression of all C. elegans GRSPs using the available public database. Evidence of transcriptional expression in GEO database showed that 65 C. elegans GRSPs were transcriptional expressions (for details, please refer to Table S1 in [12]). For the other 45 GRSPs without transcriptional evidence in GEO database, RNA reads from C. elegans transcriptome project were used to confirm their transcriptions, which showed that all GRSPs except for fipr-12 had 100% matching reads in this project (for details, please refer to Figure S5 in [12]).

The clustered distribution of GRSPs on the two nematode genomes
GRSPs distribution on their genomes was marked by following qualities (Figure 2 and Table 2 Third, GRSPs clusters were maintained in relative conserved synteny blocks on the chromosomes of the two nematodes (Figure 2 and Table 2). With the exception of four GRSPs clusters without the matching synteny clusters on C. briggsae genome, all the other GRSPs clusters possess the matching synteny clusters between the two nematodes. Generally, the lack of the four matching GRSPs synteny clusters in C. briggsae could be attributed to the following reasons: (1) no orthologs of C. elegans GRSPs were available in C. briggsae, (2) the orthologs of C. elegans GRSPs in C. briggsae were integrated into another unequal GRSPs cluster of C. briggsae, and (3) the map position of orthologs of C. elegans GRSPs on C. briggsae genome was changed. Some of the orthologous synteny clusters were observed one-to-two match on their genomes. For example, GRSPs cluster from Cbr-grsp-27 to Cbr-grsp-23 on C. briggsae chromosome V was matched to two orthologous synteny clusters (from grsp-23 to grsp-16 and from grsp-40 to grsp-4) on C. elegans chromosome V.
In addition, the order of the orthologous synteny blocks of GRSPs clusters on chromosome V was more conserved than that on other chromosomes of the two nematodes. Orthologous pairs of GRSPs Evolutionary Expansion of Nematode-Specific Glycine-Rich Secreted Peptides http://dx.doi.org/10.5772/intechopen.68621 Table 2. Summary of GRSPs clusters on the chromosomes of the two nematodes.

Nematology -Concepts, Diagnosis and Control
between the two nematodes were linked by straight lines on their genome mapping, which showed that the beelines of the orthologous GRSPs clusters on chromosomes V were more likely to be crossovers than those on other chromosomes (Figure 2). The crossover means that the order of orthologous synteny blocks of GRSPs clusters was maintained on the genomes of the two nematodes.

The transcriptional co-expression of C. elegans GRSPs clusters after infection
Genome-wide transcriptional analysis showed that many C. elegans genes that responded to infection were located in small genomic clusters [8]. All members of the GRSPs cluster from nlp-27 to nlp-34 were induced by D. coniospora infection of C. elegans [7]. Using the transcriptome data sets of C. elegans infection based on microarray quantification [7,8,[13][14][15][16], we analyzed the transcriptional expression change of C. elegans GRSPs after C. elegans infection. The results showed that a total of 108 C. elegans GRSPs showed differential expressions at transcriptional levels after C. elegans infection in previous studies, which are indicated by blue letters in Figure 3. Co-expressed clusters of C. elegans GRSPs are shadowed by grey (   (for details, please refer to Table S4 in [12]). Certainly, it is possible that two C. elegans GRSPs (grsp-24 and grsp-39) without detectable expression in previous studies analyzed here may be detectable in other studies, which we were unable to mine due to the limited length of this study [7].

The evolution of GRSPs multigene families by gene duplications
GRSPs subfamilies were classified based on the precursor sequences similarity and gene structure conservation. Phylogenetic analysis was performed using the signal peptide sequences. It is possible that the similarity between the two group sequences is not perfectly consistent among these GRSPs, which resulted in the observations that certain members within the same subfamilies were located in a different clade in the phylogenetic tree (Figure 3). Orthologous GRSPs of the two nematodes detected in the above could be well defined by phylogenetic analysis. Certain members of subfamilies (such as the members of subfamily I) were clustered together on their chromosomes and also the same clade on the phylogenetic tree (Figure 3). Five GRSPs from nlp-27 to nlp-31 were clustered on C. elegans genome. Phylogenetic analysis showed nlp-27 clade was different from the clade formed by nlp-28-nlp-31, which was similar to previous results [7]. Notes: dN, non-synonymous substitutions; dS, synonymous substitutions; SE, standard error; Ti, transition; Tv, transversion; R, overall transition/transversion bias. The overall average difference of (dN-dS) was less than zero, and standard error value was less than 0.05.

Purifying selection of the two nematode GRSPs
Under the model of codon-based Z-test, the estimate of purifying selection was conducted directly to analyze sequence pairs and overall average. Its values are identically equal to zero and therefore rejected the null hypothesis of strict neutrality (dS = dN) and accepted the alternative hypothesis. The difference in average overall of dN-dS was less than zero. The standard error values were less than 0.05. Synonymous substitutions were clearly prevailing on protein-coding sequences of the nematode GRSPs, which indicated the occurrences of purifying selection. With an average ratio of R (Ti/Tv) > 1, the patterns of nucleotide substitution also showed a predominance of transitions over transversions ( Table 4).

Discussion
Soil organisms (A. thaliana) and/or bacterial feeders (the two nematodes: D. discoideum and fruit fly, who feed on rotting fruit with a large number of bacteria) are relatively enriched for GRSPs in the current study. The environment and survival stress of soil living and/or bacterial feeding may be one of the main evolutionary driving forces for the expansion of lineage-specific GRSPs in the two nematodes. This was exemplified by the expansion of nematode-specific chemosensory genes (for C. elegans it is about 2000 and for human it is about 1000, about 2 times), which allowed it to mount a rapid response to environmental stimuli [17]. Comparing to the amplification of nematode-specific chemosensory genes, one may be more impressed by the amplification of nematode-specific GRSPs (for C. elegans, it is about 110 and for human, it is 4, about 28 times).
The conservation of precursor organizations, the unaltered position and phase of intron, together with the homologous sequence of DNA, suggested that the GRSPs clusters in the two nematodes might come from physically local DNA reproductions. The duplication of local genes came into being by gene clusters of paralogous genes whose products have similar functions. Paralogous genes with similar functions and expression patterns are frequent in C. elegans [18]. The co-expression of gene clusters encoding different proteins with similar functions in specific regions should provide effective combinatorial methods to coordinate complex biological systems [19]. The scales of most co-expression GRSPs clusters on their chromosomes are less than 10 kb and the smallest one is 1.05 kb (co-expression of grsp-40 and grsp-38) ( Table 3). Different GRSPs within the same cluster differentially responded to the same infection. For example, GRSPs from cnc-1 to cnc-5 (7.17 kb) and cnc-11 in the same cluster showed co-expression with the upregulation of cnc-11, cnc-1, and cnc-2 and the down-regulation from cnc-3 to cnc-5 after C. elegans infection [14]. GRSPs cluster from grsp-35 to grsp-36 (5.13 kb) were upregulated by M. nematophilum and P. aeruginosa infection of C. elegans [8,16] and downregulated by S. enterica and S. aureus infection [13,14]. A noticeable overlap of C. elegans GRSPs induced by different infections may indicate that the different sets of induced C. elegans GRSPs may still share some functionalities. Considering a large amount of operon regulation in C. elegans, we analyzed all C. elegans genes contained within operon by an internal Perl Scripts search to detect whether the small clusters of adjacent GRSPs could be co-regulated by operon regulation. While no C. elegans GRSPs were identified in operon regions (data not shown), the short genetic and physical distance on chromosomes and highly homologous sequences suggest that neighboring GRSPs arising from duplicated GRSPs may share the same regulatory sequences. The same regulatory sequences on their promoters can be directly and coordinately activated by transcription factors binding to the shared regulatory elements.
With similar variance of (dn-dS), the two nematode GRSPs might have experienced similar selective stress during evolution, which is in concordance with the neutral mutation-random drift theory of molecular evolution. Relative conserved synteny blocks of the GRSPs orthologous clusters suggested that these GRSPs were subjected to functional restraint. With the increasing species complexity, the genome size and the members of a gene family usually undergo an evolutionary expansion in abundance for similar essential basic cellular mechanisms shared by eukaryotes [20]. The basic physiological process for C. elegans is similar to those observed in higher organisms. Few matching orthologs of C. elegans GRSPs in the other species may indirectly reflect nematode-specific biological functions of C. elegans GRSPs that are essential for nematode-specific environments such as soil living and bacterial feeding. The evolutionary diversification of these GRSPs might enhance the ability of C. elegans innate immunity to adapt to environmental stress [7].
This study built a full set of GRSPs from the algae G. theta to the mammal human by genome-wide comparison across 10 species. The two nematodes were enriched for GRSPs, which demonstrated a good example of DNA local reproductions and maintained a relative conserved synteny block on their genomes after speciation and separation. The phylogenetic conservation of synteny GRSPs clusters on their genomes, the co-expressed GRSPs clusters, and strong purifying selection may indicate evolutionary constraints acting on C. elegans to guarantee that C. elegans could mount a rapid systematical response to infection by co-expression of GRSPs clusters on the genomes. The mechanism of co-expression, coregulation, and co-functionality behind these GRSPs clusters is still unknown. Our knowledge about it is expected to improve by the increasing comparative genomics of correlated expression patterns across different nematodes (such as C. brenneri and C. remanei), which holds promise to provide insights into the adaptive advantage of co-expressed GRSPs in nematodes.