Around 150 million years ago, eusocial termites evolved from within the cockroaches, 50 million years before eusocial Hymenoptera, such as bees and ants, appeared. Here, we report the 2-Gb genome of the German cockroach, Blattella germanica, and the 1.3-Gb genome of the drywood termite Cryptotermes secundus. We show evolutionary signatures of termite eusociality by comparing the genomes and transcriptomes of three termites and the cockroach against the background of 16 other eusocial and non-eusocial insects. Dramatic adaptive changes in genes underlying the production and perception of pheromones confirm the importance of chemical communication in the termites. These are accompanied by major changes in gene regulation and the molecular evolution of caste determination. Many of these results parallel molecular mechanisms of eusocial evolution in Hymenoptera. However, the specific solutions are remarkably different, thus revealing a striking case of convergence in one of the major evolutionary transitions in biological complexity.
BACKGROUND: Helicoverpa armigera and Helicoverpa zea are major caterpillar pests of Old and New World agriculture, respectively. Both, particularly H. armigera, are extremely polyphagous, and H. armigera has developed resistance to many insecticides. Here we use comparative genomics, transcriptomics and resequencing to elucidate the genetic basis for their properties as pests. RESULTS: We find that, prior to their divergence about 1.5 Mya, the H. armigera/H. zea lineage had accumulated up to more than 100 more members of specific detoxification and digestion gene families and more than 100 extra gustatory receptor genes, compared to other lepidopterans with narrower host ranges. The two genomes remain very similar in gene content and order, but H. armigera is more polymorphic overall, and H. zea has lost several detoxification genes, as well as about 50 gustatory receptor genes. It also lacks certain genes and alleles conferring insecticide resistance found in H. armigera. Non-synonymous sites in the expanded gene families above are rapidly diverging, both between paralogues and between orthologues in the two species. Whole genome transcriptomic analyses of H. armigera larvae show widely divergent responses to different host plants, including responses among many of the duplicated detoxification and digestion genes. CONCLUSIONS: The extreme polyphagy of the two heliothines is associated with extensive amplification and neofunctionalisation of genes involved in host finding and use, coupled with versatile transcriptional responses on different hosts. H. armigera's invasion of the Americas in recent years means that hybridisation could generate populations that are both locally adapted and insecticide resistant.
BACKGROUND: The Mediterranean fruit fly (medfly), Ceratitis capitata, is a major destructive insect pest due to its broad host range, which includes hundreds of fruits and vegetables. It exhibits a unique ability to invade and adapt to ecological niches throughout tropical and subtropical regions of the world, though medfly infestations have been prevented and controlled by the sterile insect technique (SIT) as part of integrated pest management programs (IPMs). The genetic analysis and manipulation of medfly has been subject to intensive study in an effort to improve SIT efficacy and other aspects of IPM control. RESULTS: The 479 Mb medfly genome is sequenced from adult flies from lines inbred for 20 generations. A high-quality assembly is achieved having a contig N50 of 45.7 kb and scaffold N50 of 4.06 Mb. In-depth curation of more than 1800 messenger RNAs shows specific gene expansions that can be related to invasiveness and host adaptation, including gene families for chemoreception, toxin and insecticide metabolism, cuticle proteins, opsins, and aquaporins. We identify genes relevant to IPM control, including those required to improve SIT. CONCLUSIONS: The medfly genome sequence provides critical insights into the biology of one of the most serious and widespread agricultural pests. This knowledge should significantly advance the means of controlling the size and invasive potential of medfly populations. Its close relationship to Drosophila, and other insect species important to agriculture and human health, will further comparative functional and structural studies of insect genomes that should broaden our understanding of gene family evolution.
Lucilia cuprina is a parasitic fly of major economic importance worldwide. Larvae of this fly invade their animal host, feed on tissues and excretions and progressively cause severe skin disease (myiasis). Here we report the sequence and annotation of the 458-megabase draft genome of Lucilia cuprina. Analyses of this genome and the 14,544 predicted protein-encoding genes provide unique insights into the fly's molecular biology, interactions with the host animal and insecticide resistance. These insights have broad implications for designing new methods for the prevention and control of myiasis.
BACKGROUND: The shift from solitary to social behavior is one of the major evolutionary transitions. Primitively eusocial bumblebees are uniquely placed to illuminate the evolution of highly eusocial insect societies. Bumblebees are also invaluable natural and agricultural pollinators, and there is widespread concern over recent population declines in some species. High-quality genomic data will inform key aspects of bumblebee biology, including susceptibility to implicated population viability threats. RESULTS: We report the high quality draft genome sequences of Bombus terrestris and Bombus impatiens, two ecologically dominant bumblebees and widely utilized study species. Comparing these new genomes to those of the highly eusocial honeybee Apis mellifera and other Hymenoptera, we identify deeply conserved similarities, as well as novelties key to the biology of these organisms. Some honeybee genome features thought to underpin advanced eusociality are also present in bumblebees, indicating an earlier evolution in the bee lineage. Xenobiotic detoxification and immune genes are similarly depauperate in bumblebees and honeybees, and multiple categories of genes linked to social organization, including development and behavior, show high conservation. Key differences identified include a bias in bumblebee chemoreception towards gustation from olfaction, and striking differences in microRNAs, potentially responsible for gene regulation underlying social and other traits. CONCLUSIONS: These two bumblebee genomes provide a foundation for post-genomic research on these key pollinators and insect societies. Overall, gene repertoires suggest that the route to advanced eusociality in bees was mediated by many small changes in many genes and processes, and not by notable expansion or depauperation.
BACKGROUND: The first generation of genome sequence assemblies and annotations have had a significant impact upon our understanding of the biology of the sequenced species, the phylogenetic relationships among species, the study of populations within and across species, and have informed the biology of humans. As only a few Metazoan genomes are approaching finished quality (human, mouse, fly and worm), there is room for improvement of most genome assemblies. The honey bee (Apis mellifera) genome, published in 2006, was noted for its bimodal GC content distribution that affected the quality of the assembly in some regions and for fewer genes in the initial gene set (OGSv1.0) compared to what would be expected based on other sequenced insect genomes. RESULTS: Here, we report an improved honey bee genome assembly (Amel_4.5) with a new gene annotation set (OGSv3.2), and show that the honey bee genome contains a number of genes similar to that of other insect genomes, contrary to what was suggested in OGSv1.0. The new genome assembly is more contiguous and complete and the new gene set includes ~5000 more protein-coding genes, 50% more than previously reported. About 1/6 of the additional genes were due to improvements to the assembly, and the remaining were inferred based on new RNAseq and protein data. CONCLUSIONS: Lessons learned from this genome upgrade have important implications for future genome sequencing projects. Furthermore, the improvements significantly enhance genomic resources for the honey bee, a key model for social behavior and essential to global ecology through pollination.
PURPOSE: Copy-number variations as a mutational mechanism contribute significantly to human disease. Approximately one-half of the patients with Charcot-Marie-Tooth (CMT) disease have a 1.4 Mb duplication copy-number variation as the cause of their neuropathy. However, non-CMT1A neuropathy patients rarely have causative copy-number variations, and to date, autosomal-recessive disease has not been associated with copy-number variation as a mutational mechanism. METHODS: We performed Agilent 8 x 60 K array comparative genomic hybridization on DNA from 12 recessive Turkish families with CMT disease. Additional molecular studies were conducted to detect breakpoint junctions and to evaluate gene expression levels in a family in which we detected an intragenic duplication copy-number variation. RESULTS: We detected an ~6.25 kb homozygous intragenic duplication in NDRG1, a gene known to be causative for recessive HMSNL/CMT4D, in three individuals from a Turkish family with CMT neuropathy. Further studies showed that this intragenic copy-number variation resulted in a homozygous duplication of exons 6-8 that caused decreased mRNA expression of NDRG1. CONCLUSION: Exon-focused high-resolution array comparative genomic hybridization enables the detection of copy-number variation carrier states in recessive genes, particularly small copy-number variations encompassing or disrupting single genes. In families for whom a molecular diagnosis has not been elucidated by conventional clinical assays, an assessment for copy-number variations in known CMT genes might be considered.
BACKGROUND: The yaws treponemes, Treponema pallidum ssp. pertenue (TPE) strains, are closely related to syphilis causing strains of Treponema pallidum ssp. pallidum (TPA). Both yaws and syphilis are distinguished on the basis of epidemiological characteristics, clinical symptoms, and several genetic signatures of the corresponding causative agents. METHODOLOGY/PRINCIPAL FINDINGS: To precisely define genetic differences between TPA and TPE, high-quality whole genome sequences of three TPE strains (Samoa D, CDC-2, Gauthier) were determined using next-generation sequencing techniques. TPE genome sequences were compared to four genomes of TPA strains (Nichols, DAL-1, SS14, Chicago). The genome structure was identical in all three TPE strains with similar length ranging between 1,139,330 bp and 1,139,744 bp. No major genome rearrangements were found when compared to the four TPA genomes. The whole genome nucleotide divergence (d(A)) between TPA and TPE subspecies was 4.7 and 4.8 times higher than the observed nucleotide diversity (pi) among TPA and TPE strains, respectively, corresponding to 99.8% identity between TPA and TPE genomes. A set of 97 (9.9%) TPE genes encoded proteins containing two or more amino acid replacements or other major sequence changes. The TPE divergent genes were mostly from the group encoding potential virulence factors and genes encoding proteins with unknown function. CONCLUSIONS/SIGNIFICANCE: Hypothetical genes, with genetic differences, consistently found between TPE and TPA strains are candidates for syphilitic treponemes virulence factors. Seventeen TPE genes were predicted under positive selection, and eleven of them coded either for predicted exported proteins or membrane proteins suggesting their possible association with the cell surface. Sequence changes between TPE and TPA strains and changes specific to individual strains represent suitable targets for subspecies- and strain-specific molecular diagnostics.
Many genomes have been sequenced to high-quality draft status using Sanger capillary electrophoresis and/or newer short-read sequence data and whole genome assembly techniques. However, even the best draft genomes contain gaps and other imperfections due to limitations in the input data and the techniques used to build draft assemblies. Sequencing biases, repetitive genomic features, genomic polymorphism, and other complicating factors all come together to make some regions difficult or impossible to assemble. Traditionally, draft genomes were upgraded to "phase 3 finished" status using time-consuming and expensive Sanger-based manual finishing processes. For more facile assembly and automated finishing of draft genomes, we present here an automated approach to finishing using long-reads from the Pacific Biosciences RS (PacBio) platform. Our algorithm and associated software tool, PBJelly, (publicly available at https://sourceforge.net/projects/pb-jelly/) automates the finishing process using long sequence reads in a reference-guided assembly process. PBJelly also provides "lift-over" co-ordinate tables to easily port existing annotations to the upgraded assembly. Using PBJelly and long PacBio reads, we upgraded the draft genome sequences of a simulated Drosophila melanogaster, the version 2 draft Drosophila pseudoobscura, an assembly of the Assemblathon 2.0 budgerigar dataset, and a preliminary assembly of the Sooty mangabey. With 24x mapped coverage of PacBio long-reads, we addressed 99% of gaps and were able to close 69% and improve 12% of all gaps in D. pseudoobscura. With 4x mapped coverage of PacBio long-reads we saw reads address 63% of gaps in our budgerigar assembly, of which 32% were closed and 63% improved. With 6.8x mapped coverage of mangabey PacBio long-reads we addressed 97% of gaps and closed 66% of addressed gaps and improved 19%. The accuracy of gap closure was validated by comparison to Sanger sequencing on gaps from the original D. pseudoobscura draft assembly and shown to be dependent on initial reference quality.
The comparison of related genomes has emerged as a powerful lens for genome interpretation. Here we report the sequencing and comparative analysis of 29 eutherian genomes. We confirm that at least 5.5% of the human genome has undergone purifying selection, and locate constrained elements covering approximately 4.2% of the genome. We use evolutionary signatures and comparisons with experimental data sets to suggest candidate functions for approximately 60% of constrained bases. These elements reveal a small number of new coding exons, candidate stop codon readthrough events and over 10,000 regions of overlapping synonymous constraint within protein-coding exons. We find 220 candidate RNA structural families, and nearly a million elements overlapping potential promoter, enhancer and insulator regions. We report specific amino acid residues that have undergone positive selection, 280,000 non-coding elements exapted from mobile elements and more than 1,000 primate- and human-accelerated elements. Overlap with disease-associated variants indicates that our findings will be relevant for studies of human biology, health and disease.
Treponema paraluiscuniculi is the causative agent of rabbit venereal spirochetosis. It is not infectious to humans, although its genome structure is very closely related to other pathogenic Treponema species including Treponema pallidum subspecies pallidum, the etiological agent of syphilis. In this study, the genome sequence of Treponema paraluiscuniculi, strain Cuniculi A, was determined by a combination of several high-throughput sequencing strategies. Whereas the overall size (1,133,390 bp), arrangement, and gene content of the Cuniculi A genome closely resembled those of the T. pallidum genome, the T. paraluiscuniculi genome contained a markedly higher number of pseudogenes and gene fragments (51). In addition to pseudogenes, 33 divergent genes were also found in the T. paraluiscuniculi genome. A set of 32 (out of 84) affected genes encoded proteins of known or predicted function in the Nichols genome. These proteins included virulence factors, gene regulators and components of DNA repair and recombination. The majority (52 or 61.9%) of the Cuniculi A pseudogenes and divergent genes were of unknown function. Our results indicate that T. paraluiscuniculi has evolved from a T. pallidum-like ancestor and adapted to a specialized host-associated niche (rabbits) during loss of infectivity to humans. The genes that are inactivated or altered in T. paraluiscuniculi are candidates for virulence factors important in the infectivity and pathogenesis of T. pallidum subspecies.
The human microbiome refers to the community of microorganisms, including prokaryotes, viruses, and microbial eukaryotes, that populate the human body. The National Institutes of Health launched an initiative that focuses on describing the diversity of microbial species that are associated with health and disease. The first phase of this initiative includes the sequencing of hundreds of microbial reference genomes, coupled to metagenomic sequencing from multiple body sites. Here we present results from an initial reference genome sequencing of 178 microbial genomes. From 547,968 predicted polypeptides that correspond to the gene complement of these strains, previously unidentified ("novel") polypeptides that had both unmasked sequence length greater than 100 amino acids and no BLASTP match to any nonreference entry in the nonredundant subset were defined. This analysis resulted in a set of 30,867 polypeptides, of which 29,987 (approximately 97%) were unique. In addition, this set of microbial genomes allows for approximately 40% of random sequences from the microbiome of the gastrointestinal tract to be associated with organisms based on the match criteria used. Insights into pan-genome analysis suggest that we are still far from saturating microbial species genetic data sets. In addition, the associated metrics and standards used by our group for quality assurance are presented.
We report here genome sequences and comparative analyses of three closely related parasitoid wasps: Nasonia vitripennis, N. giraulti, and N. longicornis. Parasitoids are important regulators of arthropod populations, including major agricultural pests and disease vectors, and Nasonia is an emerging genetic model, particularly for evolutionary and developmental genetics. Key findings include the identification of a functional DNA methylation tool kit; hymenopteran-specific genes including diverse venoms; lateral gene transfers among Pox viruses, Wolbachia, and Nasonia; and the rapid evolution of genes involved in nuclear-mitochondrial interactions that are implicated in speciation. Newly developed genome resources advance Nasonia for genetic research, accelerate mapping and cloning of quantitative trait loci, and will ultimately provide tools and knowledge for further increasing the utility of parasitoids as pest insect-control agents.
BACKGROUND: Gardnerella vaginalis is described as a common vaginal bacterial species whose presence correlates strongly with bacterial vaginosis (BV). Here we report the genome sequencing and comparative analyses of three strains of G. vaginalis. Strains 317 (ATCC 14019) and 594 (ATCC 14018) were isolated from the vaginal tracts of women with symptomatic BV, while Strain 409-05 was isolated from a healthy, asymptomatic individual with a Nugent score of 9. PRINCIPAL FINDINGS: Substantial genomic rearrangement and heterogeneity were observed that appeared to have resulted from both mobile elements and substantial lateral gene transfer. These genomic differences translated to differences in metabolic potential. All strains are equipped with significant virulence potential, including genes encoding the previously described vaginolysin, pili for cytoadhesion, EPS biosynthetic genes for biofilm formation, and antimicrobial resistance systems, We also observed systems promoting multi-drug and lantibiotic extrusion. All G. vaginalis strains possess a large number of genes that may enhance their ability to compete with and exclude other vaginal colonists. These include up to six toxin-antitoxin systems and up to nine additional antitoxins lacking cognate toxins, several of which are clustered within each genome. All strains encode bacteriocidal toxins, including two lysozyme-like toxins produced uniquely by strain 409-05. Interestingly, the BV isolates encode numerous proteins not found in strain 409-05 that likely increase their pathogenic potential. These include enzymes enabling mucin degradation, a trait previously described to strongly correlate with BV, although commonly attributed to non-G. vaginalis species. CONCLUSIONS: Collectively, our results indicate that all three strains are able to thrive in vaginal environments, and therein the BV isolates are capable of occupying a niche that is unique from 409-05. Each strain has significant virulence potential, although genomic and metabolic differences, such as the ability to degrade mucin, indicate that the detection of G. vaginalis in the vaginal tract provides only partial information on the physiological potential of the organism.
To understand the biology and evolution of ruminants, the cattle genome was sequenced to about sevenfold coverage. The cattle genome contains a minimum of 22,000 genes, with a core set of 14,345 orthologs shared among seven mammalian species of which 1217 are absent or undetected in noneutherian (marsupial or monotreme) genomes. Cattle-specific evolutionary breakpoint regions in chromosomes have a higher density of segmental duplications, enrichment of repetitive elements, and species-specific variations in genes associated with lactation and immune responsiveness. Genes involved in metabolism are generally highly conserved, although five metabolic genes are deleted or extensively diverged from their human orthologs. The cattle genome sequence thus provides a resource for understanding mammalian evolution and accelerating livestock genetic improvement for milk and meat production.
BACKGROUND: Enterococcus faecalis has emerged as a major hospital pathogen. To explore its diversity, we sequenced E. faecalis strain OG1RF, which is commonly used for molecular manipulation and virulence studies. RESULTS: The 2,739,625 base pair chromosome of OG1RF was found to contain approximately 232 kilobases unique to this strain compared to V583, the only publicly available sequenced strain. Almost no mobile genetic elements were found in OG1RF. The 64 areas of divergence were classified into three categories. First, OG1RF carries 39 unique regions, including 2 CRISPR loci and a new WxL locus. Second, we found nine replacements where a sequence specific to V583 was substituted by a sequence specific to OG1RF. For example, the iol operon of OG1RF replaces a possible prophage and the vanB transposon in V583. Finally, we found 16 regions that were present in V583 but missing from OG1RF, including the proposed pathogenicity island, several probable prophages, and the cpsCDEFGHIJK capsular polysaccharide operon. OG1RF was more rapidly but less frequently lethal than V583 in the mouse peritonitis model and considerably outcompeted V583 in a murine model of urinary tract infections. CONCLUSION: E. faecalis OG1RF carries a number of unique loci compared to V583, but the almost complete lack of mobile genetic elements demonstrates that this is not a defining feature of the species. Additionally, OG1RF's effects in experimental models suggest that mediators of virulence may be diverse between different E. faecalis strains and that virulence is not dependent on the presence of mobile genetic elements.
Escherichia coli DH10B was designed for the propagation of large insert DNA library clones. It is used extensively, taking advantage of properties such as high DNA transformation efficiency and maintenance of large plasmids. The strain was constructed by serial genetic recombination steps, but the underlying sequence changes remained unverified. We report the complete genomic sequence of DH10B by using reads accumulated from the bovine sequencing project at Baylor College of Medicine and assembled with DNAStar's SeqMan genome assembler. The DH10B genome is largely colinear with that of the wild-type K-12 strain MG1655, although it is substantially more complex than previously appreciated, allowing DH10B biology to be further explored. The 226 mutated genes in DH10B relative to MG1655 are mostly attributable to the extensive genetic manipulations the strain has undergone. However, we demonstrate that DH10B has a 13.5-fold higher mutation rate than MG1655, resulting from a dramatic increase in insertion sequence (IS) transposition, especially IS150. IS elements appear to have remodeled genome architecture, providing homologous recombination sites for a 113,260-bp tandem duplication and an inversion. DH10B requires leucine for growth on minimal medium due to the deletion of leuLABCD and harbors both the relA1 and spoT1 alleles causing both sensitivity to nutritional downshifts and slightly lower growth rates relative to the wild type. Finally, while the sequence confirms most of the reported alleles, the sequence of deoR is wild type, necessitating reexamination of the assumed basis for the high transformability of DH10B.
Tribolium castaneum is a member of the most species-rich eukaryotic order, a powerful model organism for the study of generalized insect development, and an important pest of stored agricultural products. We describe its genome sequence here. This omnivorous beetle has evolved the ability to interact with a diverse chemical environment, as shown by large expansions in odorant and gustatory receptors, as well as P450 and other detoxification enzymes. Development in Tribolium is more representative of other insects than is Drosophila, a fact reflected in gene content and function. For example, Tribolium has retained more ancestral genes involved in cell-cell communication than Drosophila, some being expressed in the growth zone crucial for axial elongation in short-germ development. Systemic RNA interference in T. castaneum functions differently from that in Caenorhabditis elegans, but nevertheless offers similar power for the elucidation of gene function and identification of targets for selective insect control.
The rhesus macaque (Macaca mulatta) is an abundant primate species that diverged from the ancestors of Homo sapiens about 25 million years ago. Because they are genetically and physiologically similar to humans, rhesus monkeys are the most widely used nonhuman primate in basic and applied biomedical research. We determined the genome sequence of an Indian-origin Macaca mulatta female and compared the data with chimpanzees and humans to reveal the structure of ancestral primate genomes and to identify evidence for positive selection and lineage-specific expansions and contractions of gene families. A comparison of sequences from individual animals was used to investigate their underlying genetic diversity. The complete description of the macaque genome blueprint enhances the utility of this animal model for biomedical research and improves our understanding of the basic biology of the species.
After the completion of a draft human genome sequence, the International Human Genome Sequencing Consortium has proceeded to finish and annotate each of the 24 chromosomes comprising the human genome. Here we describe the sequencing and analysis of human chromosome 3, one of the largest human chromosomes. Chromosome 3 comprises just four contigs, one of which currently represents the longest unbroken stretch of finished DNA sequence known so far. The chromosome is remarkable in having the lowest rate of segmental duplication in the genome. It also includes a chemokine receptor gene cluster as well as numerous loci involved in multiple human cancers such as the gene encoding FHIT, which contains the most common constitutive fragile site in the genome, FRA3B. Using genomic sequence from chimpanzee and rhesus macaque, we were able to characterize the breakpoints defining a large pericentric inversion that occurred some time after the split of Homininae from Ponginae, and propose an evolutionary history of the inversion.
We have sequenced the genome of a second Drosophila species, Drosophila pseudoobscura, and compared this to the genome sequence of Drosophila melanogaster, a primary model organism. Throughout evolution the vast majority of Drosophila genes have remained on the same chromosome arm, but within each arm gene order has been extensively reshuffled, leading to a minimum of 921 syntenic blocks shared between the species. A repetitive sequence is found in the D. pseudoobscura genome at many junctions between adjacent syntenic blocks. Analysis of this novel repetitive element family suggests that recombination between offset elements may have given rise to many paracentric inversions, thereby contributing to the shuffling of gene order in the D. pseudoobscura lineage. Based on sequence similarity and synteny, 10,516 putative orthologs have been identified as a core gene set conserved over 25-55 million years (Myr) since the pseudoobscura/melanogaster divergence. Genes expressed in the testes had higher amino acid sequence divergence than the genome-wide average, consistent with the rapid evolution of sex-specific proteins. Cis-regulatory sequences are more conserved than random and nearby sequences between the species--but the difference is slight, suggesting that the evolution of cis-regulatory elements is flexible. Overall, a pattern of repeat-mediated chromosomal rearrangement, and high coadaptation of both male genes and cis-regulatory sequences emerges as important themes of genome divergence between these species of Drosophila.
The human X chromosome has a unique biology that was shaped by its evolution as the sex chromosome shared by males and females. We have determined 99.3% of the euchromatic sequence of the X chromosome. Our analysis illustrates the autosomal origin of the mammalian sex chromosomes, the stepwise process that led to the progressive loss of recombination between X and Y, and the extent of subsequent degradation of the Y chromosome. LINE1 repeat elements cover one-third of the X chromosome, with a distribution that is consistent with their proposed role as way stations in the process of X-chromosome inactivation. We found 1,098 genes in the sequence, of which 99 encode proteins expressed in testis and in various tumour types. A disproportionately high number of mendelian diseases are documented for the X chromosome. Of this number, 168 have been explained by mutations in 113 X-linked genes, which in many cases were characterized with the aid of the DNA sequence.
The National Institutes of Health's Mammalian Gene Collection (MGC) project was designed to generate and sequence a publicly accessible cDNA resource containing a complete open reading frame (ORF) for every human and mouse gene. The project initially used a random strategy to select clones from a large number of cDNA libraries from diverse tissues. Candidate clones were chosen based on 5'-EST sequences, and then fully sequenced to high accuracy and analyzed by algorithms developed for this project. Currently, more than 11,000 human and 10,000 mouse genes are represented in MGC by at least one clone with a full ORF. The random selection approach is now reaching a saturation point, and a transition to protocols targeted at the missing transcripts is now required to complete the mouse and human collections. Comparison of the sequence of the MGC clones to reference genome sequences reveals that most cDNA clones are of very high sequence quality, although it is likely that some cDNAs may carry missense variants as a consequence of experimental artifact, such as PCR, cloning, or reverse transcriptase errors. Recently, a rat cDNA component was added to the project, and ongoing frog (Xenopus) and zebrafish (Danio) cDNA projects were expanded to take advantage of the high-throughput MGC pipeline.
The laboratory rat (Rattus norvegicus) is an indispensable tool in experimental medicine and drug development, having made inestimable contributions to human health. We report here the genome sequence of the Brown Norway (BN) rat strain. The sequence represents a high-quality 'draft' covering over 90% of the genome. The BN rat sequence is the third complete mammalian genome to be deciphered, and three-way comparisons with the human and mouse genomes resolve details of mammalian evolution. This first comprehensive analysis includes genes and proteins and their relation to human disease, repeated sequences, comparative genome-wide studies of mammalian orthologous chromosomal regions and rearrangement breakpoints, reconstruction of ancestral karyotypes and the events leading to existing species, rates of variation, and lineage-specific and lineage-independent evolutionary events such as expansion of gene families, orthology relations and protein evolution.
Rickettsia typhi, the causative agent of murine typhus, is an obligate intracellular bacterium with a life cycle involving both vertebrate and invertebrate hosts. Here we present the complete genome sequence of R. typhi (1,111,496 bp) and compare it to the two published rickettsial genome sequences: R. prowazekii and R. conorii. We identified 877 genes in R. typhi encoding 3 rRNAs, 33 tRNAs, 3 noncoding RNAs, and 838 proteins, 3 of which are frameshifts. In addition, we discovered more than 40 pseudogenes, including the entire cytochrome c oxidase system. The three rickettsial genomes share 775 genes: 23 are found only in R. prowazekii and R. typhi, 15 are found only in R. conorii and R. typhi, and 24 are unique to R. typhi. Although most of the genes are colinear, there is a 35-kb inversion in gene order, which is close to the replication terminus, in R. typhi, compared to R. prowazekii and R. conorii. In addition, we found a 124-kb R. typhi-specific inversion, starting 19 kb from the origin of replication, compared to R. prowazekii and R. conorii. Inversions in this region are also seen in the unpublished genome sequences of R. sibirica and R. rickettsii, indicating that this region is a hot spot for rearrangements. Genome comparisons also revealed a 12-kb insertion in the R. prowazekii genome, relative to R. typhi and R. conorii, which appears to have occurred after the typhus (R. prowazekii and R. typhi) and spotted fever (R. conorii) groups diverged. The three-way comparison allowed further in silico analysis of the SpoT split genes, leading us to propose that the stringent response system is still functional in these rickettsiae.
BACKGROUND: The Drosophila melanogaster genome was the first metazoan genome to have been sequenced by the whole-genome shotgun (WGS) method. Two issues relating to this achievement were widely debated in the genomics community: how correct is the sequence with respect to base-pair (bp) accuracy and frequency of assembly errors? And, how difficult is it to bring a WGS sequence to the accepted standard for finished sequence? We are now in a position to answer these questions. RESULTS: Our finishing process was designed to close gaps, improve sequence quality and validate the assembly. Sequence traces derived from the WGS and draft sequencing of individual bacterial artificial chromosomes (BACs) were assembled into BAC-sized segments. These segments were brought to high quality, and then joined to constitute the sequence of each chromosome arm. Overall assembly was verified by comparison to a physical map of fingerprinted BAC clones. In the current version of the 116.9 Mb euchromatic genome, called Release 3, the six euchromatic chromosome arms are represented by 13 scaffolds with a total of 37 sequence gaps. We compared Release 3 to Release 2; in autosomal regions of unique sequence, the error rate of Release 2 was one in 20,000 bp. CONCLUSIONS: The WGS strategy can efficiently produce a high-quality sequence of a metazoan genome while generating the reagents required for sequence finishing. However, the initial method of repeat assembly was flawed. The sequence we report here, Release 3, is a reliable resource for molecular genetic experimentation and computational analysis.
The National Institutes of Health Mammalian Gene Collection (MGC) Program is a multiinstitutional effort to identify and sequence a cDNA clone containing a complete ORF for each human and mouse gene. ESTs were generated from libraries enriched for full-length cDNAs and analyzed to identify candidate full-ORF clones, which then were sequenced to high accuracy. The MGC has currently sequenced and verified the full ORF for a nonredundant set of >9,000 human and >6,000 mouse genes. Candidate full-ORF clones for an additional 7,800 human and 3,500 mouse genes also have been identified. All MGC sequences and clones are available without restriction through public databases and clone distribution networks (see http:mgc.nci.nih.gov).
The fly Drosophila melanogaster is one of the most intensively studied organisms in biology and serves as a model system for the investigation of many developmental and cellular processes common to higher eukaryotes, including humans. We have determined the nucleotide sequence of nearly all of the approximately 120-megabase euchromatic portion of the Drosophila genome using a whole-genome shotgun sequencing strategy supported by extensive clone-based sequence and a high-quality bacterial artificial chromosome physical map. Efforts are under way to close the remaining gaps; however, the sequence is of sufficient accuracy and contiguity to be declared substantially complete and to support an initial analysis of genome structure and preliminary gene annotation and interpretation. The genome encodes approximately 13,600 genes, somewhat fewer than the smaller Caenorhabditis elegans genome, but with comparable functional diversity.
A total of 100 kb of DNA derived from 69 individual human brain cDNA clones of 0.7-2.0 kb were sequenced by concatenated cDNA sequencing (CCS), whereby multiple individual DNA fragments are sequenced simultaneously in a single shotgun library. The method yielded accurate sequences and a similar efficiency compared with other shotgun libraries constructed from single DNA fragments (> 20 kb). Computer analyses were carried out on 65 cDNA clone sequences and their corresponding end sequences to examine both nucleic acid and amino acid sequence similarities in the databases. Thirty-seven clones revealed no DNA database matches, 12 clones generated exact matches (> or = 98% identity), and 16 clones generated nonexact matches (57%-97% identity) to either known human or other species genes. Of those 28 matched clones, 8 had corresponding end sequences that failed to identify similarities. In a protein similarity search, 27 clone sequences displayed significant matches, whereas only 20 of the end sequences had matches to known protein sequences. Our data indicate that full-length cDNA insert sequences provide significantly more nucleic acid and protein sequence similarity matches than expressed sequence tags (ESTs) for database searching.
The efficiency of shotgun DNA sequencing depends to a great extent on the quality of the random-subclone libraries used. We here describe a novel "double adaptor" strategy for efficient construction of high-quality shotgun libraries. In this method, randomly sheared and end-repaired fragments are ligated to oligonucleotide adaptors creating 12-base overhangs. Nonphosphorylated oligonucleotides are used, which prevents formation of adaptor dimers and ensures efficient ligation of insert to adaptor. The vector is prepared from a modified M13 vector, by KpnI/PstI digestion followed by ligation to oligonucleotides with ends complementary to the overhangs created in the digest. These adaptors create 5'-overhangs complementary to those on the inserts. Following annealing of insert to vector, the DNA is directly used for transformation without a ligation step. This protocol is robust and shows three- to fivefold higher yield of clones compared to previous protocols. No chimeric clones can be detected and the background of clones without an insert is <1%. The procedure is rapid and shows potential for automation.