Volume 48, Issue s1 pp. 182s-183s
Full Access

The Pneumocystis Genome Project: Update and Issues

MELANIE T. CUSHION

Corresponding Author

MELANIE T. CUSHION

Department of Internal Medicine, University of Cincinnati College of Medicine, Cincinnati, OH 45267 and the Cincinnati VAMC, Cincinnati, OH 45220

Corresponding author: M.T. Cushion. Telephone: (513) 861–3100 X4417; E-mail: [email protected]Search for more papers by this author
A. GEORGE SMULIAN

A. GEORGE SMULIAN

Department of Internal Medicine, University of Cincinnati College of Medicine, Cincinnati, OH 45267 and the Cincinnati VAMC, Cincinnati, OH 45220

Search for more papers by this author
First published: 11 July 2005
Citations: 9

The international effort to create a Pneumocystis Genome Project was launched during the 5th International Workshop on Opportunistic Protists in Lille, France, 1997 [1,2]. There it was decided to use the most common rat Pneumocystis population, Pneumocystis carinii f. sp. carinii karyotype form 1, as the primary focus of the project. The genome was to be sequenced with the aid of a physical map constructed from cosmid libraries. A partial Expressed Sequence Tag (EST) database from mRNA of organisms harvested from fulminant infection was to be created in the first 2 years. The community would have access to sequences and reagents generated by the project.

In April of 1999, the Pneumocystis Genome Project was funded by an RO1 mechanism from the National Institutes of Health for a length of 5 years. A consortium of international collaborators volunteered to participate in the Project. The funded participants included Dr. Melanie Cushion as Principal Investigator; Drs. George Smulian and James Stringer from the University of Cincinnati as Co-investigators, Sub contractors Dr. Jonathan Arnold from the University of Georgia; Dr. Chao-Hung Lee from Indiana University; and Dr. Chuck Staben from the University of Kentucky. Drs. Yoshi Nakamura, Miki Wada and Hiroshi Mori from Tokyo University; Dr. C. Ben Beard from the CDC, Atlanta, Georgia; and Dr. Peter Philippsen, Basel, Switzerland, were to contribute to the genome sequencing effort on an ad hoc basis. A supplemental grant from the AIDS-FIRCA Program was awarded to Dr. Ann Wakefield and Dr. Cushion in September, 2000 with the goal to clone and sequence the chromosome ends of Pneumocystis carinii f. sp. carinii form 1.

EST DATABASE

A cDNA library was constructed in Uni-ZAP XR (Stratagene Inc.) of Pneumocystis carinii f. sp. carinii karyotype form 1 organisms from a fulminant infection in a Long Evans rat by Dr. Smulian and used to generate most of the ESTs in the database. A small number of ESTs (-500) from a cDNA library constructed by Dr. Jeff Edman were included in the 5030 sequences analyzed. A pipeline to analyze and assemble the ESTs was constructed by Bradley Slaven, University of Cincinnati. The Pneumocystis genes in the EST sequences and bacterial or rat contaminants were identified at the nucleotide level using WU BLAST 2.0 (Warren Gish, Washington University, St. Louis, MO) and the nucleotide database available at http://www.ncbi.nlm.gov. The rat and bacterial sequences with probability scores of p < -40 were removed from the database and constituted 15% of the total. The Pneumocystis genes (13%) were assembled using 2 settings of PHRED/PHRAP and CAP3 (http://www.plirap.org, http://www.genome.cs.mtu.edu/sas.litml). Visual examination of the sequence bins produced by each program run showed CAP3 provided the best fit of the sequences with fewer anomalies and was used in subsequent analyses. Conservative assessment of the 85 contigs and 50 singletons revealed 33 single copy Pneumocystis genes and 58 multi-genes (total of 91) in the EST database. The nan-Pneumocystis sequences were assembled into 622 contigs and 982 singletons that were further divided into 422 potential orthologs and 196 low hits (p<−10); and 368 potential orthologs and 600 low hits, respectively. A conservative estimate of the number of potential Pneumocystis genes in the EST database was calculated to be 1677. Most of the potential orthologs (p < −10) were fungal in origin (91%) and a majority of the fungal orthologs matching the Pneumocystis ESTs were genes/sequences of Schizosaccharomyces pombe (66%), followed by S. cerevisiue (19%). A multitude of metabolic cycles and cellular functions were represented by the ESTs including electron transport; glycolysis; signal transduction; mating; transcription; translation; replication; heme biosynthesis and sterol biosynthesis. Website interfaces with the capacity to search the ESTs can be found at: http://www.biology.ukv.edu/Pc; http://gene.genetics.uga.edu. A new site consolidating all the information will soon be available at http://www.pneumocystis.org.

PHYSICAL MAP

A physical map for the 7.7 Mbp genome of P. carinii f. sp. carinii is being Constructed using a dual strategy, mapping by sequencing and hybridization of cosmids to the Uni-ZAP cDNA library. About 3,000 ends of the pWEB and Lorist 6Xh cosmid libraries constructed for the mapping project have been sequenced. These ends will be used with the sequences from about 120 cosmids generated by shotgun sequencing to facilitate the mapping by sequencing strategy. About 200 camids were hybridized to the UniZAP cDNA library to link together cosmids hybridizing to the same cDNAs. Chromosomes 3,7, and 15 have been partially constructed using this combination of strategies. A new tool for viewing the map is under construction at the University of Georgia and will be available at: http://gene.genetics.uga.edu.

GENOMIC ORGANIZATION

The pWEB cosmid 15A6 was sequenced, gaps closed by PCR, and assembled by Dr. George Smulian using CAP3 [3]. This represents the longest continuous genomic sequence from Pneumocystis to date and provides insights into its genomic organization. The cosmic DNA sequence was localized to chromosome 1 (the largest) of the form 1 chromosome profile, ∼700 kb. Fifteen genes were identified in the 32,083-bp insert; 6 on the Watson strand and 9 on the Crick strand. Genes contained 1 to 7 introns, averaging 31 to 235 bases in length. The gene density was calculated to be 1 gene/2139 bases for an estimated 3740 genes in the genome, based on an 8 Mbp genome size. Two cosmids, 3G5 and 1B2. containing members of the msg-, prf- and msr- gene families, and sub-telomeric and telomeric sequences were sequenced by the Sanger Institute (http://www.sanger.ac.uk) and analyzed in collaboration with Drs. Wakefield, Stringer, Smulian, Keely. and Cushion. The array structures in each cosmid were distinct from one another and are described in detail in these Proceedings [4,5]. Of note, cosmid 1B2 contained 2 prt genes, the first example of an array with more than 1 protease gene. Cosmids with arrays shared with 1B2 were identified in the pWEB library and analyzed by restriction enzyme digests. Two of these 3 cosmid inserts mapped to band 9 on a P. carinii f. sp. carinii form 1 profile, while the third mapped to band 4. These data suggest the same array can be found at the ends of 2 different chromosomes and suggests that these arrays or large portions of them may move within the genome. The strategy to sequence the genome of P. carinii f. sp. carinii has been changed from sequencing of a cosmid minimal tile to one of sequencing camid contigs by preparation of shotgun libraries from sets of cosmids found to be linked by physical mapping strategies (described above) with gap filling by PCR. An approximate 1X coverage of the genome has been completed to date.

MITOCHONDRIAL GENOME

The mitochondrial genome of P. carinii f. sp. carinii has been estimated to be -40–50 kb based on migration through pulsed field gels [6]. Sequencing of the ends of the pWEB identified sequence homologous with fungal mitochondrial genes. One cosmid, 12c7, (-12,000 bp) was sequenced and partially assembled using CAPS. Three contigs of 3194-, 8146- and 988-bp were constructed and analyzed by WU BLAST to nucleotide and protein databases. Genes identified in the P. carinii f. sp. Carinii mitochondrial genome included cytochrome B, NADH dehydrogenase subunits 4 and 5, cytochrome oxidase 3, large subunit rRNA, ATPase subunit 6 and 9. These genes and those previously identified in GenBank, suggest that the mitochondrial genome of P. carinii f. sp. carinii resembles most fungal mitochondria] genomes except that of Schhosaccharomyces pombe, which is much smaller at 19 kb, and lacks the NADH genes. The migration of the mitochondrial genome on CHEF gels may be suggestive of a linear rather than circular configuration.

AVAILABLE REAGENTS AND INFORMATICS

The UniZap cDNA library constructed for the Genome Project and used to generate the EST database are available from the American Type Culture Collection using the same naming convention as in the databases (http://www.ATCC.org). The pWEB cosmid library has not yet been deposited but clones are available upon request: [email protected]. The websites at which the genetic data are available are listed above. In addition, a new website consolidating genetic information, links to other useful sites, and general information will be available by 2002 at: http://www.Pneumocystis.org. The Pneumocystis genes/proteins are being curated by Proteome, Inc./Invitrogen, Lie and are included in a “MycoPathPD” database containing like data from other pathogenic fungi. An interactive catalogue of genes and proteins by sequence and function will be available for academic and corporate usage at http://www.proteoine.com. Registration is required. Curation of the genes and proteins of Pneumocystis has been made difficult by: 1) poor identification of the host origin of the Pneumocyslis when submitting sequences to public databases; and 2) inconsistent gene nomenclature. To address these issues, we have formulated conventions for gene nomenclature. The issue of the origin of the organism will be somewhat addressed by the movement towards speciation discussed at the 7th IWOP and contained in these Proceedings [7], However, if an investigator is in doubt when submitting a sequence, simply use the name of the host of origin.

NAMING CONVENTIONS

As additional gene sequence becomes available from both the Pneumocystis Genome project and through the sequencing efforts of individual laboratories, a standardized naming convention will greatly facilitate annotation and curation efforts. Putative genes identified by the genome sequencing effort will be assigned an interim identifier based on their genomic location based on a schema similar to that used by the Saccharomyces cerevisiae genome sequencing effort. The identifier will follow the convention PCCA001, PCCA002, PCCA003…, where PCC identifies the organism origin as Pneumocystis carinii f. sp. carinii, A indicates its location on chromosome L, and 001, 002, 003…, as adjacent genes within a contiguous sequence (chromosome, contig or cosmid). In addition to the identifier, individual genes may be assigned gene names using the convention proposed by Stringer and Cushion [6]. Association of the gene identifier (e.g. PCCA001), the gene name (e.g. gshl), and the ESTs within the database (e.g. s35f9) will be provided at the new Pneumocystis website (http://www.pneumocvstis.org. As proposed [6], genes should be assigned a three letter, lower case, italics name (jryz); the putative gene product should be designated by a three roman letter name with the initial letter capitalized (Xyz). Genes should be assigned the name of the S. cerevisiae ortholog (where gene function is often best characterized) with the understanding that the function of the gene product in P. carinii may ultimately be shown to be distinct from that of its homolog. Gene families will be identified by a three roman letter name capitalized (e.g. MSG family), with individual genes designated msgj, msg2, msg3… and the predicted proteins Msgl, Msg2, Msg3…. Adherence to a standardized naming convention will greatly assist later efforts to correlate all available data and to assemble data within databases such as “MycoPathPD” and the Pneumocystis database at http://www.pneumocystis.org.

ACKNOWLEDGMENTS

The authors would personally like to acknowledge the financial support of the corporate sponsors and the Burroughs Wellcome Fund as well as the efforts of the Workshop organizer, Edna Kancshiro.

    The full text of this article hosted at iucr.org is unavailable due to technical difficulties.