Glutamine synthetase evolutionary history revisited: Tracing back beyond the Last Universal Common Ancestor
Abstract
Glutamine synthetase (GS; EC 6.3.1.2, L-glutamate: ammonia ligase) is an essential enzyme in nitrogen assimilation. It catalyzes glutamine synthesis using glutamate and ammonium with ATP hydrolysis. Four forms of GSs have been described in literature. These enzyme types are discriminated based on their primary and quaternary structures. GS-encoding genes are believed to be of the oldest functioning genes studied, and its evolutionary history was explored in classic studies in the 90s. Here, we evaluated GS-homologous sequences from the three life domains to revisit their origins and evolutionary history. There are clear examples of ancient duplications and interdomain horizontal gene transfers. We present GS-encoding genes as one multigenic family that comprises three distinct groups. Our findings are presented in light of two main hypotheses for GS origins and evolutions, and we argue in favor of gene duplications giving rise to the three genes in the Last Universal Common Ancestral. Type I family is the most diverse one, presenting a subgroup of polyamine metabolizing enzymes, besides many examples of noncatalytic GS homologs. Many instances of gene loss, duplication, and transfer have occurred after life diversification, contributing to GS complex evolutionary history.
Glutamine synthetase (GS; EC 6.3.1.2; L-glutamate: ammonia ligase) is a central enzyme in nitrogen metabolism. It participates on the GS-GOGAT cycle, a fundamental ammonium assimilatory pathway, catalyzing the ATP-dependent production of glutamine from glutamate and ammonium (Meek and Villafranca 1980). It is considered a vital enzyme present in all living organisms due to its essential role. In fact, GS-encoding genes are believed to be of the oldest functioning genes studied (Kumada et al. 1993).
There are four forms of GS described up to now. They are all homo-oligomeric enzymes, characterized by their primary and quaternary structures and encoded by distinct genes. Type I GSs (GSI) are dodecameric enzymes built from ∼55 kDa protomers arranged as two rings (Almassy et al. 1986). Each ring is composed of six subunits encoded by glnA. Type II (GSII) are encoded by glnII. They were first reported as octamers with subunits of ∼36 kDa (Llorca et al. 2006), and later described as decamers composed of two pentameric rings (Unno et al. 2006; He et al. 2009). Type III (GSIII) were first described in anaerobic bacteria and are encoded by glnN (Southern et al. 1986; Hill et al. 1989). Initially believed to be hexamers (Hill et al. 1989), they are now recognized as dodecamers as well, with a subunit molecular mass of ∼75 kDa (Van Rooyen et al. 2011).
There is a fourth and debatable form of GS described in Rhizobium: the GlnT, encoded by glnT (Espin et al. 1990). These are described as octameric enzymes with a subunit molecular mass of ∼47 kDa (Shatters et al. 1993). They have also been called GSIII (Chiurazzi et al. 1992; Liu and Kahn 1995; Merrick and Edwards 1995), regardless of glnN-encoded GSIII. However, despite being described as considerably different from GSI and GSII (Liu and Kahn 1995), there are reports of their close relationship to Type I GSs (Pesole et al. 1995). Nevertheless, it adds substantial confusion to the nomenclature, and GlnT relationship with the other GSs needs to be clarified. An additional nomenclature issue refers to the gene family/superfamily. Mostly in literature, the different GS types are introduced as Types only. However, some authors refer to them as distinct genes families (Ghoshroy et al. 2010), whereas others refer to one single gene family (Van Rooyen et al. 2011).
Initially, GSI were found only in Bacteria and Archaea, whereas GSII were found only in Eukaryotes (Carlson and Chelm 1986). When Type I sequences were identified in some soil bacteria, it was first interpreted as a lateral transfer from plants to symbiotic bacteria (Carlson and Chelm 1986). Afterward, phylogenetic studies indicated that these variants result from an ancient duplication preceding the divergence between life domains (Kumada et al. 1993). This study predicted Type I GSs to be present in Eukaryotes as well, and such examples have been found (Mathis et al. 2000; Wyatt et al. 2006).
The classic research that established our basic knowledge on GS evolution dates to the 90s (Pesole et al. 1991, 1995; Kumada et al. 1993; Brown et al. 1994). These studies were developed before the accumulation of genomic data and were not revisited subsequently. In addition, Type III GSs have been excluded from robust phylogenetic investigations in a broader context. Besides being the last well-recognized type to be described, these proteins are known to display highly divergent sequences comparing to GSI and GSII, which challenges the investigation of its evolutionary relationships with the others (Brown et al. 1994). Therefore, a modern comprehensive approach to investigate GS evolution is missing.
Here, we performed phylogenetic analysis using GS homologs sequences from the three domains of life to reconstruct the evolutionary history of this gene family. We present one multigenic family that encompasses three distinct groups, instead of four. The fourth type (GlnT) is actually a subgroup within Type I sequences. We highlight GS homologs with functions other than the traditional glutamine biosynthesis and advocate for new genes and proteins to be investigated in the proper phylogenetic context. There are two main hypotheses on GS origins and evolution. The first one postulates that the distinct GS types are encoded by ortholog genes that diversified along with life diversification. The second one postulates that they are encoded by ancient paralog genes that further diversified along life diversification. We sustain that the three GS-encoding genes were already present in the Last Universal Common Ancestral (LUCA) and were originated from gene duplications in the origin of life period.
Material and Methods
SEQUENCE SEARCH AND IDENTIFICATION
Four well-recognized protein sequences, representing each of the GS types, were used as queries to search for GS homologs encoded by chosen genomes using BLASTp with an E-value threshold of 10−3. These sequences and their UniProtKB (http://www.uniprot.org/) accession numbers are as follows: Salmonella typhimurium ATCC 700720 GSI (P0A1P6), Rhizobium leguminosarum bv. phaseoli GSII (Q02154), Bacteroides fragilis ATCC 25285 GSIII (Q5LGP1), and Rhizobium meliloti strain 1021 GlnT (O87393). Bacterial, archaeal, and fungi sequences were retrieved from selected finished genomes deposited in JGI Integrated Microbial Genomes (https://img.jgi.doe.gov/). CyanoBase (http://genome.microbedb.jp/cyanobase/) was used to search complete cyanobacterial genomes, and GenoList (http://genolist.pasteur.fr/) was used to search Mycobacterium tuberculosis H37Rv genome. Plant, animal, and additional fungi sequences were retrieved from Phytozome version 12.1.6 (https://phytozome.jgi.doe.gov/), Metazome version 3.2 (https://metazome.jgi.doe.gov/), and MycoCosm (https://genome.jgi.doe.gov/programs/fungi/index.jsf), respectively.
We used an acronym to identify the sequences according to its origin genome. The acronyms are composed of three letters: the first one represents the first letter of the genus epithet, and the two others represent the first and second letters of the species epithet (e.g., Salmonella typhimurium has “Sty” designed as acronym, so all sequences found in this genome will be identified with “Sty” followed by its identification number). The exceptions are a few acronyms with four letters, to correct for ambiguities, and Synechococcus sp. WH 7803 (“Syn”) designated by the first three letter of the genus epithet.
The sequences identified were evaluated for the presence of conserved domains using PFAM database (https://pfam.xfam.org/). Protein sequences with no identified GS catalytic domain (Gln-synt_C, PFAM PF00120) were excluded. Other sequences with annotation ambiguities or poorly aligned in a preliminary analysis were also excluded. The remaining protein sequences comprise our first dataset with 331 sequences. Table 1 (Bacteria and Archaea) and Table 2 (Eukarya) show the list of the genomes searched, the acronyms used to identify the sequences found in each of them, and the number of sequences found in each genome.
Phylum | Number of sequences/Type | ||||||||
---|---|---|---|---|---|---|---|---|---|
Archaea | Species | Strain | Acronymon | Genome ID | I | II | III | Total | Source database |
Thermoproteota | Cenarchaeum symbiosum | A | Csy | 641522613 | 2 | 2 | |||
Desulfurococcus fermentans | Z-1312, DSM 16532 | Defe | 2513020047 | 1 | 1 | ||||
Thaumarchaeota archaeon | MY3 | Tha | 2657244923 | 3 | 3 | ||||
Methanobacteriota | Methanococcus maripaludis | S2 / LL | Mmar | 2563366543 | 1 | 1 | |||
Pyrococcus furiosus | DSM 3638 | Pfu | 638154515 | 1 | 1 | ||||
Thermococcus gammatolerans | EJ3 | Tga | 644736411 | 1 | 1 | ||||
Thermoplasmatota | Methanogenic archaeon | ISO4-H5 | Mar | 2660238307 | 1 | 1 | |||
Thermoplasmatales archaeon | BRNA1 | The | 2565956561 | 1 | 1 | ||||
Halobacteriota | Haloferax volcanii | DS2, ATCC 29605 | Hvo | 646564536 | 1 | 1 | |||
Methanosarcina mazei | LYC | Mmaz | 2630968898 | 2 | 2 |
Bacteria | Species | Strain | Acronymon | Genome ID | I | II | III | Total | Source Database |
---|---|---|---|---|---|---|---|---|---|
Cyanobacteria | Anabaena variabilis | ATCC 29413 | Ava | 1 | 1 | Cyanobase | |||
Gloeobacter violaceus | PCC 7421 | Gvi | 637000121 | 2 | 2 | ||||
Nostoc punctiforme | PCC 73102 | Npu | 2 | 2 | Cyanobase | ||||
Synechococcus sp. | WH 7803 | Syn | 2 | 1 | 3 | Cyanobase | |||
Actinobacteriota | Acidimicrobium ferrooxidans | ICP, DSM 10331 | Afe | 644736322 | 4 | 4 | |||
Acidobacterium capsulatum | ATCC 51196 | Aca | 643692001 | 2 | 2 | ||||
Acidothermus cellulolyticus | 11B | Ace | 639633001 | 4 | 4 | ||||
Actinomyces radicidentis | CCUG 36733 | Ara | 2687453525 | 2 | 2 | ||||
Brevibacterium linens | BS258 | Bli | 2675903680 | 3 | 3 | ||||
Corynebacterium diphtheriae | BH8 | Cdi | 2511231099 | 2 | 2 | ||||
Corynebacterium glutamicum | ATCC 13032 | Cgl | 639279306 | 2 | 2 | ||||
Cryobacterium arcticum | PAMC 27867 | Car | 2751185728 | 2 | 1 | 3 | |||
Frankia alni | ACN14a | Fal | 637000115 | 5 | 1 | 6 | |||
Geodermatophilus obscurus | G-20, DSM 43160 | Gob | 646311931 | 2 | 2 | ||||
Gordonia bronchialis | 3410, DSM 43247 | Gbr | 646311932 | 3 | 1 | 4 | |||
Ilumatobacter coccineum | YM16-304 | Ico | 2545824624 | 1 | 1 | 2 | |||
Kineococcus radiotolerans | SRS30216 | Kra | 640753031 | 3 | 1 | 4 | |||
Mycobacterium tuberculosis | H37Rv | Mtu | 4 | 4 | GenoList | ||||
Streptomyces albus | DSM 41398 | Sal | 2639762818 | 4 | 1 | 5 | |||
Streptomyces avermitilis | MA-4680 | Sav | 637000304 | 3 | 1 | 4 | |||
Firmicutes | Bacillus subtilis subtilis | 168 | Bsu | 646311909 | 1 | 1 | |||
Clostridium beijerinckii | 59B | Cbe | 2608642293 | 1 | 1 | 2 | |||
Lactobacillus acidophilus | NCFM | Lac | 637000138 | 1 | 1 | ||||
Mycoplasma pneumoniae | 309 | 2511231209 | 0 | ||||||
Spirochaetota | Spirochaeta thermophila | DSM 6192 | Sth | 648028053 | 1 | 1 | |||
Verrucomicrobiota | Chlamydia trachomatis | IU888 | Ctr | 2554235384 | 0 | ||||
Verrucomicrobia bacterium | HZ-65 | Vba | 2775506790 | 1 | 1 | ||||
Planctomycetota | Planctomycetes bacterium | NH11 | Pba | 2751185673 | 1 | 1 | 2 | ||
Bacteroidota | Bacteroides fragilis | 638R | Bfr | 650377910 | 1 | 1 | 2 | ||
Chlorobium phaeobacteroides | DSM 266 | Cph | 639633020 | 1 | 1 | ||||
Draconibacterium orientale | FH5 | Dor | 2576861037 | 1 | 1 | 1 | 3 | ||
Dyadobacter fermentans | DSM 18053 | Dyfe | 644736356 | 1 | 1 | 1 | 3 | ||
Flavobacterium communis | PK15 | Fco | 2765235939 | 1 | 1 | 2 | |||
Flavobacterium psychrophilum | VQ50 | Fps | 2627853801 | 1 | 1 | ||||
Runella slithyformis | DSM 19594 | Rli | 2505679030 | 1 | 1 | 1 | 3 | ||
Aquificota | Aquifex aeolicus | VF5 | Aae | 637000010 | 1 | 1 | |||
Myxococcota | Cystobacter fuscus | DSM 52655 | Cfu | 2767802757 | 1 | 1 | 2 | ||
Myxococcus stipitatus | DSM 14675 | Mst | 2521172697 | 2 | 2 | ||||
Desulfobacterota | Geobacter sulfurreducens | KN400 | Gsu | 648231707 | 1 | 1 | |||
Thermodesulfobacterium commune | DSM 2178 | Tco | 2588253728 | 2 | 2 | ||||
Proteobacteria | Chromobacterium violaceum | ATCC 12472 | Cvi | 637000074 | 4 | 4 | |||
Escherichia coli | K-12, MG1655 | Eco | 646311926 | 2 | 2 | ||||
Nitrosomonas communis | Nm2 | Nco | 2627854142 | 1 | 1 | ||||
Pseudomonas aeruginosa | ATCC 27853 | Pae | 2687453218 | 8 | 8 | ||||
Rhizobium tropici | CIAT899 | Rtr | 2524023199 | 5 | 1 | 6 | |||
Rhodospirillum rubrum | F11 | Rru | 2511231162 | 4 | 4 | ||||
Salmonella typhimurium | LT2 | Sty | 637000255 | 1 | 1 | ||||
Xanthomonas campestris | 17 | Xca | 2639763137 | 3 | 3 |
- Phyla were designated according to GTDB taxonomy. Thermoproteota refers to the former Crenarchaeota. Methanobacteriota refers to the former Euryarchaeota. Sequences with no specified source database were retrieved from JGI/IMG.
Species | Number of sequences/type | |||||||
---|---|---|---|---|---|---|---|---|
Embryophyta | Acronymon | I | II | III | Total | Source database | ||
Phytozome | ||||||||
Amborella trichopoda | Atr | 1 | 4 | 5 | ||||
Ananas comosus | Acom | 1 | 6 | 7 | ||||
Aquilegia coerulea | Acoe | 1 | 4 | 5 | ||||
Arabidopsis thaliana | Ath | 1 | 6 | 7 | ||||
Brachypodium distachyon | Bdi | 1 | 4 | 5 | ||||
Brassica rapa FPsc | Bra | 2 | 9 | 11 | ||||
Citrus clementina | Ccl | 1 | 3 | 4 | ||||
Citrus sinensis | Csi | 1 | 3 | 4 | ||||
Eucalyptus grandis | Egr | 1 | 5 | 6 | ||||
Glycine max | Gma | 1 | 8 | 9 | ||||
Manihot esculenta | Mes | 2 | 4 | 6 | ||||
Medicago truncatula | Mtr | 6 | 5 | 11 | ||||
Oryza sativa | Osa | 1 | 4 | 5 | ||||
Physcomitrella patens | Ppa | 6 | 6 | 12 | ||||
Populus trichocarpa | Ptr | 1 | 8 | 9 | ||||
Ricinus communis | Rco | 1 | 2 | 3 | ||||
Selaginella moellendorffii | Smo | 2 | 2 | 4 | ||||
Setaria italica | Sit | 1 | 4 | 5 | ||||
Solanum lycopersicum | Sly | 3 | 5 | 8 | ||||
Solanum tuberosum | Stu | 2 | 5 | 7 | ||||
Zea mays PH207 | Zma | 5 | 5 |
Chlorophyta | Acronymon | I | II | III | Total | Source database | ||
---|---|---|---|---|---|---|---|---|
Phytozome | ||||||||
Chlamydomonas reinhardtii | Cre | 1 | 3 | 4 | ||||
Micromonas pusilla | Mpu | 1 | 2 | 3 | ||||
Ostreococcus lucimarinus | Olu | 1 | 1 | |||||
Volvox carteri | Vca | 3 | 3 |
Metazoa | Acronymon | I | II | III | Total | Source database | ||
---|---|---|---|---|---|---|---|---|
Metazome | ||||||||
Branchiostoma floridae | Bfl | 3 | 3 | |||||
Caenorhabditis elegans | Cel | 5 | 5 | |||||
Ciona intestinalis | Cin | 1 | 1 | |||||
Danio rerio | Dre | 1 | 5 | 6 | ||||
Drosophila melanogaster | Dme | 2 | 2 | |||||
Gallus gallus | Gga | 1 | 2 | 3 | ||||
Homo sapiens | Hsa | 1 | 1 | 2 | ||||
Nematostella vectensis | Nve | 1 | 1 | |||||
Tribolium castaneum | Tca | 2 | 2 | |||||
Xenopus tropicalis | Xtr | 1 | 2 | 3 |
Fungi | Acronymon | I | II | III | Total | Source database | Strain | Genome ID |
---|---|---|---|---|---|---|---|---|
Aspergillus niger | Ani | 2 | 1 | 3 | IMG | CBS 513.88 | 640281012 | |
Encephalitozoon intestinalis | 0 | IMG | ATCC 5050 | 2517287008 | ||||
Obelidium mucronatum | Omu | 2 | 2 | 4 | MycoCosm | JEL802 | ||
Rozella allomycis | Ral | 1 | 1 | MycoCosm | CSF55 | |||
Saccharomyces cerevisiae | Sce | 1 | 1 | IMG | S288C | 638208609 | ||
Ustilago bromivora | Ubr | 1 | 1 | 2 | IMG | UB2112 | 2739368079 |
Heterokonta | Acronymon | I | II | III | Total | Source database | Strain | Genome ID |
---|---|---|---|---|---|---|---|---|
Dictyostelium discoideum | Ddi | 2 | 1 | 3 | GenBank | AX4 | AAFI00000000 | |
Thalassiosira pseudonana | Tps | 1 | 1 | 1 | 3 | IMG | CCMP 1335 | 649328906 |
PHYLOGENETIC RECONSTRUCTION
Amino acid sequences were aligned with MUSCLE (Edgar 2004), following the default parameters implemented in MEGA7 (Kumar et al. 2016). Phylogenetic trees were constructed based on three datasets. The first one comprises the 331 sequences from all gene families. When codifying the original alignment in GBlocks (http://molevol.cmima.csic.es/castresana/Gblocks_server.html) (Castresana 2000), the result was too restricted, and the relationships were not well resolved. Hence, we decided to inspect and codify the original alignment manually. With 236 amino acid positions, this resulting alignment was used to reconstruct the phylogeny of all sequences.
The second and third datasets are sequence subsets selected based on the results from the first tree. These subsets were aligned de novo, and the alignments were codified using GBlocks. The second one comprises only Type I-related proteins, totaling 158 sequences with 168 amino acid positions. The third one includes a subset of Type I-related proteins denoted GSI-α and GSI-β, totaling 60 sequences with 374 amino acid positions.
The codified alignments were evaluated with ProtTest 3.4.2 (Darriba et al. 2011) to select the amino acid substitution model, which was LG + G for the first and second datasets, and LG + I + G for the third one. The Bayesian analyses were performed using Beast version 1.10.4 (Drummond and Rambaut 2007) implemented on CIPRES Science Gateway (Miller et al. 2010): the “Speciation: Birth-Death Process” was selected as tree prior for the analysis. The MCMC chain was run for 70,000,000 generations (first and second datasets) or 50,000,000 generations (third dataset). Tracer version 1.6 (http://tree.bio.ed.ac.uk/software/tracer/) was used to verify the chain convergence and the expected effective sample size (ESS > 200).
The consensus trees were built with TreeAnnotator version 1.10.4, with a burnin defined as 10% as trees. Trees obtained were visualized and annotated using the Interactive Tree of Life online tool (https://itol.embl.de/). Bayesian posterior probability (PP) of at least 0.95 was specified to define well-supported clades.
Results and Discussion
IDENTIFICATION OF GS-LIKE SEQUENCES
Reference sequences representing each of the GS types were used as queries in searching for GS homologs, as detailed before (SEQUENCE SEARCH AND IDENTIFICATION). A total of 101 genomes were used to search for GS-homologous sequences: 10 Archaea, 48 Bacteria, six Fungi, 25 Plants, 10 Metazoa, one amoeba, and one diatom. Only three of these genomes harbored no detectable GS-like sequences: from the pathogenic bacteria Chlamydia trachomatis and Mycoplasma pneumoniae, and from the pathogenic fungus Encephalitozoon intestinalis.
We had a final dataset of 331 sequences, excluding poorly aligned sequences and those with no GS catalytic domain annotated (Gln-synt_C, PFAM PF00120). Although some genomes harbored just one GS-like sequence, others displayed a variable number of identified sequences. Notably, some genomes from bacteria and plants harbored, for example, eight (Pseudomonas aeruginosa) or 12 (Physcomitrella patens) sequences. These data are presented in Tables 1 and 2, which summarize information referring to the genomes searched, and GS-homologous sequences found in each of them. Detailed information on BLAST results for each sequence is presented in Table S1.
Regarding protein domain organization, we retained in our dataset only sequences harboring the GS catalytic domain (Gln-synt_C, PFAM PF00120). Figure 1 shows the domain structure representation of protein sequences used as references. Type I and Type II reference sequences present a similar structure, with an additional beta-Grasp domain. Type III reference sequence harbors two additional and exclusive domains, whereas Type T sequence presents only the catalytic domain. Besides that, proteins presenting two other domains composing three domain arrangements were identified in our dataset, with examples from P. aeruginosa, Arabidopsis thaliana, and Brassica rapa, represented in Figure 1 as well. These three proteins belong to the Type I family.

GS-HOMOLOGOUS SEQUENCES REPRESENT A MULTIGENIC GENE FAMILY WITH THREE DISTINCT GROUPS
The final dataset of 331 GS-homologous sequences was used for the phylogenetic reconstruction represented in Figure 2. Three major groups of GS-homologous proteins were observed. Looking for the reference sequences indicated by checkmarks, we observe that the Type I group comprises both Type I and Type T GSs. Although the Salmonella typhimurium sequence (Type I reference) is embedded in a large well-supported subgroup (totaling 67 tips), the Rhizobium meliloti sequence (Type T reference) is only clearly related to a few noncharacterized proteins. These results sustain that the previously described GlnT sequences belong to the Type I family.

It seems that the proteins described as GlnT are Type I GSs that had lost their β-Grasp domain. Of course, we can see many other sequences similar to the Type T for displaying no domain other than the catalytic one, but they do not comprehend a wide well-supported clade with the Type T reference. There is one group comprising sequences with C-domain only, described further on as GSI-γ, but it was not possible to clarify its relationship with the GlnT. The R. meliloti GlnT reference sequence belongs to a clade with six other proteins, being two from plant genomes (Selaginella moellendorffii and Physcomitrella patens), and four from bacterial genomes: another Proteobacteria (Rhizobium tropici), one Cyanobacteria (Gloeobacter violaceus), and two Actinobacteriota (Frankia alni and Gordonia bronchialis). In that sense, we identified additional sequences clearly related to the described GlnT, but they have not been characterized yet. The only GlnT proteins characterized besides the R. meliloti reference belong to Rhizobium leguminosarum (Espin et al. 1990; Chiurazzi et al. 1992) and Agrobacterium tumefaciens (Rossbach et al. 1988). Rhizobium meliloti GlnT was described from a glnA-glnII mutant strain, lacking both GSI and GSII enzymes (Shatters et al. 1993). It was demonstrated to be posttranslationally regulated (Liu and Kahn 1995). Rhizobium leguminosarum GlnT was also demonstrated to exert GS activity (Espin et al. 1990). However, the A. tumefaciens glnT locus was considered cryptic, despite being responsible for GS activity when expressed in Escherichia coli (Rossbach et al. 1988).
The closer relationship between GlnT and GSI was first demonstrated by Pesole et al. (1995) with a limited number of sequences. However, the term has caused confusion even after that (Merrick and Edwards 1995; Mathis et al. 2000). Based on our results and evidence from the literature, we suggest that the term GlnT should be abandoned, unless new evidence is brought to light to clarify any specificity. The proteins described as GlnT so far must be considered as Type I GSs.
GSIII family harbors bacterial, archaeal, and eukaryotic proteins. Some organisms (as the archaea Methanogenic archaeon, the bacterium Chlorobium phaeobacteroides, the fungus Rozella allomycis, and the algae Ostreococcus lucimarinus) present this protein as the only GS encoded by their genomes, so it could represent a functional GS. Other genomes encode additional GS homologs besides the Type III (see Tables 1 and 2). For example, the bacterium Bacteroides fragilis genome encodes the Type III reference sequence and one additional Type I protein. The fungus Obelidium mucronatum genome harbors two sister sequences belonging to family III, and two additional Type I-related sequences. It is interesting to note that GSIII were detected in basal eukaryotic groups, as the Chytridiomycota O. mucronatum and the Cryptomycota R. allomycis. Ghoshroy and Robertson (2012) suggested that GSIII genes may have already been present in early eukaryotic genomes. Also, horizontal gene transfer (HGT) events between Eukaryotes have been reported (Ghoshroy and Robertson 2015), contributing to the complexity of this history.
Regarding domain structure, Type III GSs are well conserved between them and clearly distinct from the others, presenting two signature exclusive domains. Type II GSs are largely homogenous with the βC domain architecture, which is shared with a subset of Type I homologs. In addition to that, Type I sequences display examples with many other domain arrangements.
We did not find archaeal sequences belonging to the Type II family, but this family harbors both bacterial and eukaryotic sequences (Fig. 2), as expected (Brown et al. 1994). There is one subgroup comprising only bacterial proteins, whereas the other one accommodates some bacterial and all the eukaryotic Type II sequences. Considering the 137 eukaryotic GSII sequences in our dataset, 131 of them belong to the well-supported Eukaryotic-only clade, whereas the other six are grouped with bacterial proteins, comprising the mixed clade. In the Eukaryotic-only clade, we identified one clade comprising Fungi sequences, another one comprising Metazoan sequences, and a third one harboring Chlorophyta, Embryophyta, and one diatom sequence.
We identified the GSII mixed clade, harboring five bacterial proteins grouped along with moss and Chlorophyta sequences (Fig. 2). Ghoshroy et al. (2010) identified a similar clade when investigating GSII evolution. The mixed clade is related to the bacterial-only clade in their results, being interpreted as a nonendosymbiotic gene transfer from Gamma-Proteobacteria to the Chloroplastida. In contrast, the mixed clade is related to the eukaryotic-only one in our results, suggesting otherwise.
Ghoshroy and Robertson (2012) have found some GSII encoded by bacterial genomes related to the eukaryotic-only clade with an additional bacterial-only sister group with a broader sampling, corroborating our results. Based on our observations, we suggest an HGT from a protoeukaryote to a protobacterium. Such HGT event “before the animal/plant split” had already been proposed (Pesole et al. 1991). Continued investigation with extended diversity will help to clarify the groups involved in such an ancient interdomain HGT event. Nonetheless, the bacterial only GSII group is well established, which supports GSII origins before the three domains split.
Plant GSs grouped on both GSI and GSII families, with one large well-supported group each. There are three plant Type I sequences outside of the major plant group, belonging to Physcomitrella patens and Selaginella moellendorffii genomes, suggesting horizontal transfer events from bacteria. It seems likely that the Type II proteins have diversified more in plants. Most of the genomes evaluated display only one GSI in the main group, whereas all of them display numerous GSII (see Table 2). Even so, some plants show two or three Type I sequences in the main group, and Medicago truncatula genome harbors six of them altogether. Zea mays genome is the only one with no Type I sequence detected here. We had found putative GSI sequences in maize, but they were excluded from our initial databank for being poorly aligned. In searching NCBI database with the Type I reference, we found a protein annotated as FluG (NCBI Reference Sequence: XP_023157110.1), distinct from the ones we had found on Phytozome. Hence, we believe maize genome encodes for a GSI homolog, but it was not present in our databank, probably because of annotation issues.
For obvious reasons, Type II GSs have been better characterized in plants, whereas the functions of Type I homologs remain mostly unknown (Bernard and Habash 2009). The traditionally studied plant GSs (Type II) are categorized in two groups, according to subcellular localization: GS1 enzymes are located in the cytosol, whereas GS2 enzymes are in the plastids (Bernard and Habash 2009).
The number of GS-encoding genes varies among plant species, as shown here. Besides that, it is in general recognized that the isoenzymes abundance varies according to species and between organs and cell types in the same plant, responding to developmental, environmental, and metabolic signals. These differences allow assimilation and recycling of ammonium from different sources in the soil or within the plant (Bernard and Habash 2009). Reports of analyses on these genes expression profiles and proteins properties are available for Arabidopsis (Ishiyama et al. 2004), wheat (Bernard et al. 2008), poplar (Castro-Rodríguez et al. 2015), and tomato (Liu et al. 2016), for example. Chloroplastidic and cytosolic GSs are believed to represent sister groups originated from gene duplication in plant history (Doyle 1991; Biesiadka and Legocki 1997). The results presented here agree with those from literature, as we observe a clear plant clade within the eukaryotic GSII sequences, suggesting there was no transfer from endosymbionts to the nuclear genome.
Type I homologs in plants may have been overlooked because they do not display GS activity, as it has been demonstrated for M. truncatula (Silva et al. 2015) and A. thaliana (Doskočilová et al. 2011). These six GSI genes had already been identified in M. truncatula (Mathis et al. 2000), and two of the genes demonstrated to be expressed (Silva et al. 2015). The results also indicated these proteins to be involved in root and nodule formation, probably somehow participating in nitrogen sensing and/or signaling. The one A. thaliana Type I homolog (represented in Fig. 1) was characterized by Doskočilová et al. (2011), who proposed a role in biotic stress signaling involving microbial elicitation, besides root morphogenesis as well. The authors refer to this protein as NodGS, for its nodulin-like sequence fusion, the amidohydrolase domain. They also highlight the structural similarity with FluG, a morphogenetic factor in fungi. We did detect such domain arrangement (A-C, see Fig. 1) within plants and fungi. Besides that, plants display GSI homologs with other domain arrangements, whose functions have not been investigated. Mathis et al. (2000) had already found GSI sequences in plants other than M. truncatula. Here, we show that GSI proteins occurrence is pervasive in plants and reinforce the paralogous origin of Type I and II encoding genes, as predicted by Kumada et al. (1993).
Animals are also represented on both GSI and GSII families. Although all animal genomes evaluated harbored at least one (and some of them several) Type II GS, only four animal Type I sequences were retrieved (belonging to Danio rerio, Xenopus tropicalis, Gallus gallus, and Homo sapiens genomes). They form one consistent clade, but its relationship with the other groups is not resolved (see Fig. 3). Human Type I-related sequence was first described from its transcripts in an expression profile survey of the eye lens. It was designated Lengsin, for lens GS like (LGS) (Wistow et al. 2002). It has already been classified as a GSI-related protein (Wyatt et al. 2006), and proved to display no catalytic activity (Grassi et al. 2006). It probably exerts a structural role in the eyes lenses, having been recruited to a completely new and specific function (Grassi et al. 2006). Wyatt et al. (2006) demonstrated similar proteins to be conserved in other vertebrate species. Also, similar sequences were identified in the sea urchin genome, indicating an ancient gene with ancient function in metazoans (Wyatt et al. 2006).

As in plants, metazoan GSII is the mainly characterized and biochemically functional GS gene family. GS is a well-known enzyme in neuroscience for its participation in the glutamate/GABA-glutamine cycle, because glutamine is used in the brain as an essential precursor for the synthesis of the neurotransmitters glutamate and c-aminobutyric acid (GABA) (Bak et al. 2006; Walls et al. 2015). Another example is GS role in ammonium detoxification. Ip and Chew (2010) presented a good review about ammonium metabolism in fishes, comparing it to mammals.
TYPE I GS HOMOLOGS COMPRISE DISTINCT SUBGROUPS
Figure 3 represents the phylogenetic reconstruction considering only the Type I GS family to better examine these proteins and their relationships, because GSI-related sequences appear to display the most extensive heterogeneity. The three subgroups for which there is evolutionary and/or empirical evidence of GS function are highlighted. These subgroups will be denoted from now on as GSI-α, GSI-β, and GSI-γ. As discussed previously, biochemical investigations of proteins in the other subgroups, from animals, fungi, and plants, suggest that these subgroups lack GS activity. There is no function described in literature for the other sequences.
The subgroups GSI-α and GSI-β were originally proposed by Brown et al. (1994), and are detailed in Figure 4a. The biochemically functional enzymes characterized then were differentiated by a 25-amino acid insertion present in GSI-β sequences, and by the evidence of posttranslational modification via adenylylation only among GSI-β proteins (Brown et al. 1994). Pesole et al. (1995) established similar relationships within the GSI family and referred to these two subgroups as GSI (A–) and GSI (A+), corresponding to α and β, respectively. This classification is specified by the presence (+) or absence (–) of the adenylylation/deadenylylation mechanism.

The works of Brown et al. (1994) and Pesole et al. (1995) highlighted an HGT event from Archaea to Bacteria, representing the Firmicutes GSI-α sequences, also reflected in our data. Pesole et al. (1995) also argued that both subgroups must have resulted from a duplication event: the genes diverged and were lost, so that the extant organisms retained one or the other subgroup.
We found two clades comprising Actinobacterial GSs that were not represented in these first studies, on both GSI-α and GSI-β branches (see Fig. 4a). Almost all the genomes from Actinobacteriota (Actinobacteria) evaluated here were represented on both. We also observed a new Archaeal clade within the GSI-β branch. Hence, we interpret the clear split at the node encompassing both subgroups as a gene duplication event (Fig. 3), providing additional evidence to support Pesole et al.’s (1995) hypothesis that GSI-α and GSI-β branches represent paralog sequences that originated before the divergence between life domains. Several lineages lost one or the other or both, whereas the Actinobacteriota (Actinobacteria), in general, retained both.
It is interesting to observe that the Firmicutes Bacillus subtilis protein is the canonical GSI-α, being one of the widely characterized GS enzymes (Wray et al. 2001; Fisher and Wray 2008; Murray et al. 2013). However, it is usually not discussed in the context of its evolutionary proximity to archaeal proteins. Bacillus subtilis displays no other GS homolog, and this enzyme presents unique regulatory properties (Sonenshein 2008). It will be important to empirically characterize GSI-α from organisms in the bacterial clade, especially those that harbor both α and β sequences.
It is worth noticing that both Brown et al. (1994) and Pesole et al. (1995) classifications encompass all the Type I sequences evaluated in these works. We recovered such groups in our analysis with additional examples. Nevertheless, we noticed that a large proportion of GSI-related sequences do not classify as GSI-α or GSI-β. Of course, the eukaryotic proteins were not evaluated then, whereas here we identified several eukaryotic Type I proteins, as presented before. However, what is more interesting about our results is the establishment of the third Type I subgroup encompassing mostly bacterial proteins (colored in blue in Fig. 3), to which we will refer as GSI-γ (Gamma). The sequences belonging to this Gamma subgroup are detailed in Figure 4b. Most of the biochemically characterized proteins (highlighted in colors) do not display GS biosynthetic activity. In fact, they function as γ-glutamyl-polyamine synthetases. We will explore this function below.
We consider the well-characterized E. coli protein (Eco646313225, in blue in Fig. 4b) as the reference sequence for GSI-γ group. Escherichia coli genome encodes two GS homologs. One of them is the traditional GSI-β (closely related to the Salmonella typhimurium GSI reference), widely studied and used in several classic experiments that established basic GS biochemical properties (Shapiro and Stadtman 1968; Meek and Villafranca 1980; Stadtman 2001). The other one catalyzes the ATP-dependent γ-glutamylation of putrescine, a reaction similar to the traditional GS biosynthetic activity, but using the polyamine putrescine as substrate (Kurihara et al. 2005). Kurihara et al. (2005) also examined other genes in a cluster related to the GS homolog encoding one, established their function as a novel putrescine utilization pathway, and proposed the gene names as puu (from putrescine utilization). The GS homolog was denoted puuA, being responsible for the first step in this catabolic pathway and described as γ-glutamyl-putrescine synthetase.
There are six sequences highlighted in purple in Figure 4b. These are encoded by P. aeruginosa genome, which also relies on γ-glutamylation for polyamine catabolism (Yao and Lu 2011). Pseudomonas aeruginosa GSI-γ proteins are involved in a polyamine utilization pathway similar to that one described in E. coli but expanded. Pseudomonas aeruginosa is able to metabolize a broader repertoire of polyamines, including diaminopropane, cadaverine, and spermidine, using a larger variety of γ-glutamyl-polyamine synthetase enzymes (Yao and Lu 2011).
Another characterized protein in this group belongs to M. tuberculosis (in orange in Fig. 4b), embedded in an Actinobacterial clade with sequences from six other species. MtuGlnA4 is a functional GS, but it is not an essential enzyme (Harth et al. 2005). Also, genes orthologous to the MtuGlnA4 encoding one were lost or disrupted in other species from the genus, which indicates they are undergoing reductive evolution (Hayward et al. 2009). Nevertheless, what is most important to discuss regarding these GS homologs is their classification. Mycobacterium tuberculosis genome encodes four functional GSs (Harth et al. 2005). On investigating GS evolution in mycobacteria, Hayward et al. (2009) classified three of them as Type II GSs. However, our results classified them as Type I sequences (Table 1; Fig. 3). Our results agree that GlnA1 is a GSI-β. Additionally, we present GlnA2 as a GSI-α, and GlnA4 as a GSI-γ. GlnA3 is not related to any of these well-established subgroups, but it clearly belongs to the Type I family. We evaluated the same four sequences in comparison with references from the three Types, so we believe this previous classification must be revised.
Despite there are no other actinobacterial Gamma homologs biochemically characterized in our dataset, there is still evidence of their participation in polyamine catabolism. GS-like sequences from Streptomyces coelicolor have been demonstrated to catalyze the gamma-glutamylation of cadaverine and putrescine (GlnA3, Krysenko et al. 2017) and ethanolamine (GlnA4, Krysenko et al. 2019). These studies predicted a general pathway for polyamine utilization in S. coelicolor (Krysenko et al. 2017), and demonstrated the presence of a pretty similar ethanolamine utilization gene cluster in Streptomyces albus (Krysenko et al. 2019). Hence, we believe the proteins from S. albus (Sal2640834740) and S. avermitilis (Sav637209896) evaluated here (Fig. 4b) probably display a similar function.
Polyamines are ubiquitous small polycations related to diverse biological functions in all organisms, including gene regulation, cell growth, proliferation, and differentiation (Bachrach 2010; Bae et al. 2018). Beyond the physiological functions, polyamines can be recycled by many bacteria or serve as carbon and nitrogen sources when released from cells into the environment. Hence, this is the proposed function (Kurihara et al. 2005; Yao and Lu 2011) of the proteins here described as GSI-γ, enabling polyamines utilization under nutrient starvation conditions. We can observe that other noncharacterized sequences belong to the Gamma clade. In this context, the biochemical and physiological functions discussed for the E. coli and P. aeruginosa sequences are probable functions to be investigated in characterizing these proteins.
OVERALL EVOLUTIONARY PICTURE
There has been a historical debate on whether the GSI and GSII types originated from diversification of one ancient gene, or from an ancient duplication originating two paralogous genes that further diversified (Kumada et al. 1993; Brown et al. 1994; Pesole et al. 1995). In Figure 5, we present two scenarios proposed for GS origins and evolution, based on these two hypotheses. GSIII, however, has been left out of the debate, so we included it in our representation.

Figure 5a presents the first scenario. It considers that LUCA presented only one GS-encoding gene, and that the three genes (glnA, glnII, and glnN) diversified along with life diversification. In this scenario, GSI has been historically associated to Bacteria, whereas GSII has been historically associated to Eukarya. We cannot pinpoint GSIII origin, so we represent GSIII origin in each domain as question marks. In this context, bacterial GSII was transferred from Eukaryotes in two distinct intradomain HGT events, whereas eukaryotic GSI was transferred from bacteria. Figure 5b presents the second scenario, in which duplications in the Origins of Life Domain period gave rise to three GS-encoding genes. According to this hypothesis, LUCA already presented the three GS types. The absence of GSII in Archaea would be explained then by an early gene loss in the Archaea lineage.
Interdomain HGT events and gene losses have definitely occurred, contributing to the puzzling phylogeny. A clear interdomain HGT occurred from Archaea to Firmicutes, which seem to have lost the bacterial GSI and retained the archaeal GSI-α. Firmicutes also lost the other GS types, with one exception in Clostridium beijerinckii that still harbors GSIII. Another clear interdomain HGT took place from a basal eukaryotic group to bacteria in glnII history, represented by the mixed GSII clade previously discussed. These two events are represented on both scenarios.
Nevertheless, we believe the preponderance of evidence favors the hypothesis that the GSI-α and GSI-β subgroups resulted from a duplication that occurred prior to the LUCA. Kumada et al. (1993) predicted GSI to be found in Eukaryotes, which were later found (Mathis et al. 2000; Wyatt et al. 2006). Our results extend the repertoire of Type I GS homologs encoded by Eukaryotic genomes. These proteins remained unknown probably for not exerting GS catalytic activity, but many examples demonstrated these proteins to be expressed and proposed new functions. Some of them are still related to nitrogen homeostasis, whereas others appear to have acquired unexpected functions.
We also provided additional evidence supporting that GSI-α and GSI-β are paralogs, pointing to a duplication in LUCA. GSII, albeit still not found in Archaea, also points to its origins in LUCA, because we observe the clear split between bacterial and eukaryotic sequences. GSIII family, despite fewer examples found, displays well-conserved proteins across the three domains. The fact that eukaryotic basal groups harbor GSIII may be interpreted as it being present at eukaryotic origins and lost multiple times in different lineages. So, we believe the genes encoding for the three GS families trace back to LUCA.
It is not disputed that GS history points to the origin of life. Weiss et al. (2016) traced 355 family proteins back to LUCA and pictured that it accessed nitrogen via nitrogenase and GS, referring to glnA (GSI). In addition, we believe that initial GS-encoding genes diversification took place during the early life origins period, so the three types were already present. The nitrogenase among other features (Kim and Caetano-Anollés 2011; Weiss et al. 2016) favors the view of LUCA as a population with a rather complex metabolism and genome. For example, Adam et al. (2018) pictured a genome that encodes an enzymatic complex with at least four subunits similar to the current enzyme, the carbon monoxide dehydrogenase/acetyl-CoA synthase (CODH/ACS), and it is generally accepted that LUCA possessed a ribosome prototype with a basal translation system (Anantharaman et al. 2002; Koonin and Martin 2005; Fox 2010). In this context, translational GTPases have been present at the core of translation since before LUCA, and its superfamily evolutionary history indicates pre-LUCA diversification with four homolog proteins already depicted (Atkinson 2015).
Pre-LUCA duplication events could be rather common, presenting a dynamic genome in the pre-LUCA/LUCA era. Labedan et al. (1999) presented a complex set of paralogous genes encoding for carbamoyltransferases in LUCA, with successive duplication events originating four genes before the life domains split. Also, the HAD (haloacid dehalogenase) superfamily is believed to have diverged prior to LUCA, with at least five proteins already present (Burroughs et al. 2006). So, it seems plausible to picture a scenario where there was a proto-GS encoding gene early in the origins of life domain period, around when biochemistry was invented and protein-coding genes were established, probably in the so-called RNA world (Koonin and Martin 2005). This ancient gene experienced rounds of duplication so that LUCA already displayed some GS diversity.
Additional studies are required to keep on unveiling GS-homologous proteins diversity. Here, we expanded this overall diversity picture and presented our data in light of the two main GS origin hypotheses. We favor the scenario of GS origins and initial diversification before the three domains split. It is important to characterize proteins from new species genetically and biochemically. However, it is imperative to evaluate these genes and proteins in the proper phylogenetic context. It is especially interesting to investigate organisms with two or three GS types/subtypes, because they can provide clues to understand this initial diversification and the surprising dynamic of gene evolution during the origin of life domain period and early life diversification.
AUTHOR CONTRIBUTIONS
GdCF conceptualized the idea of the study, performed visualization, and wrote the original draft. GdCF and ACT-Z performed formal analysis. GdCF, ACT-Z, and LMPP interpreted the data and reviewed and edited the manuscript. ACT-Z and LMPP provided resources. LMPP acquired funding.
ACKNOWLEDGMENTS
This work was supported by the Conselho Nacional de Desenvolvimento Científico e Tecnológico (CNPq) and Newton Fund (Brazil-UK collaboration).
DATA ARCHIVING
There are no data to be archived.
CONFLICT OF INTEREST
The authors declare no conflict of interest.
LITERATURE CITED
Associate Editor: C. Burch
Handling Editor: A. McAdam