NAR Top Articles - Database
Pfam: the protein families database
Finn, RD; Bateman, A; Clements, J; Coggill, P; Eberhardt, RY; Eddy, SR; Heger, A; Hetherington, K; Holm, L; Mistry, J; Sonnhammer, ELL; Tate, J; Punta, M
Nucleic Acids Res. 2014, 42, D222-D230
Free Full Text
Pfam, available via servers in the UK (http://pfam.sanger.ac.uk/) and the USA (http://pfam.janelia.org/), is a widely used database of protein families, containing 14 831 manually curated entries in the current release, version 27.0. Since the last update article 2 years ago, we have generated 1182 new families and maintained sequence coverage of the UniProt Knowledgebase (UniProtKB) at nearly 80%, despite a 50% increase in the size of the underlying sequence database. Since our 2012 article describing Pfam, we have also undertaken a comprehensive review of the features that are provided by Pfam over and above the basic family data. For each feature, we determined the relevance, computational burden, usage statistics and the functionality of the feature in a website context. As a consequence of this review, we have removed some features, enhanced others and developed new ones to meet the changing demands of computational biology. Here, we describe the changes to Pfam content. Notably, we now provide family alignments based on four different representative proteome sequence data sets and a new interactive DNA search interface...
The NHGRI GWAS Catalog, a curated resource of SNP-trait associations
Welter, D; MacArthur, J; Morales, J; Burdett, T; Hall, P; Junkins, H; Klemm, A; Flicek, P; Manolio, T; Hindorff, L; Parkinson, H
Nucleic Acids Res. 2014, 42, D1001-D1006
Free Full Text
The National Human Genome Research Institute (NHGRI) Catalog of Published Genome-Wide Association Studies (GWAS) Catalog provides a publicly available manually curated collection of published GWAS assaying at least 100 000 single-nucleotide polymorphisms (SNPs) and all SNP-trait associations with P < 1 x 10(-5). The Catalog includes 1751 curated publications of 11 912 SNPs. In addition to the SNP-trait association data, the Catalog also publishes a quarterly diagram of all SNP-trait associations mapped to the SNPs' chromosomal locations. The Catalog can be accessed via a tabular web interface, via a dynamic visualization on the human karyotype, as a downloadable tab-delimited file and as an OWL knowledge base. This article presents a number of recent improvements to the Catalog, including novel ways for users to interact with the Catalog and changes to the curation infrastructure.
starBase v2.0: decoding miRNA-ceRNA, miRNA-ncRNA and protein-RNA interaction networks from large-scale CLIP-Seq data
Li, JH; Liu, S; Zhou, H; Qu, LH; Yang, JH
Nucleic Acids Res. 2014, 42, D92-D97
Free Full Text
Although microRNAs (miRNAs), other non-coding RNAs (ncRNAs) (e.g. lncRNAs, pseudogenes and circRNAs) and competing endogenous RNAs (ceRNAs) have been implicated in cell-fate determination and in various human diseases, surprisingly little is known about the regulatory interaction networks among the multiple classes of RNAs. In this study, we developed starBase v2.0 (http://starbase.sysu.edu.cn/) to systematically identify the RNA-RNA and protein-RNA interaction networks from 108 CLIP-Seq (PAR-CLIP, HITS-CLIP, iCLIP, CLASH) data sets generated by 37 independent studies. By analyzing millions of RNA-binding protein binding sites, we identified similar to 9000 miRNA-circRNA, 16 000 miRNA-pseudogene and 285 000 protein-RNA regulatory relationships. Moreover, starBase v2.0 has been updated to provide the most comprehensive CLIP-Seq experimentally supported miRNA-mRNA and miRNA-lncRNA interaction networks to date. We identified similar to 10 000 ceRNA pairs from CLIP-supported miRNA target sites. By combining 13 functional genomic annotations, we developed miRFunction and ceRNAFunction web servers to predict the function of miRNAs and other ncRNAs...
JASPAR 2014: an extensively expanded and updated open-access database of transcription factor binding profiles
Mathelier, A; Zhao, XB; Zhang, AW; Parcy, F; Worsley-Hunt, R; Arenillas, DJ; Buchman, S; Chen, CY; Chou, A; Ienasescu, H; Lim, J; Shyr, C; Tan, G; Zhou, M; Lenhard, B; Sandelin, A; Wasserman, WW
Nucleic Acids Res. 2014, 42, D142-D147
Free Full Text
JASPAR (http://jaspar.genereg.net) is the largest open-access database of matrix-based nucleotide profiles describing the binding preference of transcription factors from multiple species. The fifth major release greatly expands the heart of JASPAR-the JASPAR CORE subcollection, which contains curated, non-redundant profiles-with 135 new curated profiles (74 in vertebrates, 8 in Drosophila melanogaster, 10 in Caenorhabditis elegans and 43 in Arabidopsis thaliana; a 30% increase in total) and 43 older updated profiles (36 in vertebrates, 3 in D. melanogaster and 4 in A. thaliana; a 9% update in total). The new and updated profiles are mainly derived from published chromatin immunoprecipitation-seq experimental datasets. In addition, the web interface has been enhanced with advanced capabilities in browsing, searching and subsetting. Finally, the new JASPAR release is accompanied by a new BioPython package, a new R tool package and a new R/Bioconductor data package to facilitate access for both manual and automated methods.
Data, information, knowledge and principle: back to metabolism in KEGG
Kanehisa, M; Goto, S; Sato, Y; Kawashima, M; Furumichi, M; Tanabe, M
Nucleic Acids Res. 2014, 42, D199-D205
Free Full Text
In the hierarchy of data, information and knowledge, computational methods play a major role in the initial processing of data to extract information, but they alone become less effective to compile knowledge from information. The Kyoto Encyclopedia of Genes and Genomes (KEGG) resource (http://www.kegg.jp/ or http://www.genome.jp/kegg/) has been developed as a reference knowledge base to assist this latter process. In particular, the KEGG pathway maps are widely used for biological interpretation of genome sequences and other high-throughput data. The link from genomes to pathways is made through the KEGG Orthology system, a collection of manually defined ortholog groups identified by K numbers. To better automate this interpretation process the KEGG modules defined by Boolean expressions of K numbers have been expanded and improved. Once genes in a genome are annotated with K numbers, the KEGG modules can be computationally evaluated revealing metabolic capacities and other phenotypic features. The reaction modules, which represent chemical units of reactions, have been used to analyze design principles of metabolic networks and also to improve the definition of K numbers...
miRBase: annotating high confidence microRNAs using deep sequencing data
Kozomara, A; Griffiths-Jones, S
Nucleic Acids Res. 2014, 42, D68-D73
Free Full Text
We describe an update of the miRBase database (http://www.mirbase.org/), the primary microRNA sequence repository. The latest miRBase release (v20, June 2013) contains 24 521 microRNA loci from 206 species, processed to produce 30 424 mature microRNA products. The rate of deposition of novel microRNAs and the number of researchers involved in their discovery continue to increase, driven largely by small RNA deep sequencing experiments. In the face of these increases, and a range of microRNA annotation methods and criteria, maintaining the quality of the microRNA sequence data set is a significant challenge. Here, we describe recent developments of the miRBase database to address this issue. In particular, we describe the collation and use of deep sequencing data sets to assign levels of confidence to miRBase entries. We now provide a high confidence subset of miRBase entries, based on the pattern of mapped reads. The high confidence microRNA data set is available alongside the complete microRNA collection at http://www.mirbase.org/. We also describe embedding microRNA-specific Wikipedia pages on the miRBase website to encourage the microRNA community to contribute and share textual and functional information.
The Human Phenotype Ontology project: linking molecular biology and disease through phenotype data
Kohler, S; Doelken, SC; Mungall, CJ; Bauer, S; Firth, HV; Bailleul-Forestier, I; Black, GCM; Brown, DL; Brudno, M; Campbell, J; FitzPatrick, DR; Eppig, JT; Jackson, AP; Freson, K; Girdea, M; Helbig, I; Hurst, JA; Jahn, J; Jackson, LG; Kelly, AM; Ledbetter
Nucleic Acids Res. 2014, 42, D966-D974
Free Full Text
The Human Phenotype Ontology (HPO) project, available at http://www.human-phenotype-ontology.org, provides a structured, comprehensive and well-defined set of 10,088 classes (terms) describing human phenotypic abnormalities and 13,326 subclass relations between the HPO classes. In addition we have developed logical definitions for 46% of all HPO classes using terms from ontologies for anatomy, cell types, function, embryology, pathology and other domains. This allows interoperability with several resources, especially those containing phenotype information on model organisms such as mouse and zebrafish. Here we describe the updated HPO database, which provides annotations of 7,278 human hereditary syndromes listed in OMIM, Orphanet and DECIPHER to classes of the HPO. Various meta-attributes such as frequency, references and negations are associated with each annotation. Several large-scale projects worldwide utilize the HPO for describing phenotype information in their datasets...
RefSeq microbial genomes database: new representation and annotation strategy
Tatusova, T; Ciufo, S; Fedorov, B; O'Neill, K; Tolstoy, I
Nucleic Acids Res. 2014, 42, D553-D559
Free Full Text
The source of the microbial genomic sequences in the RefSeq collection is the set of primary sequence records submitted to the International Nucleotide Sequence Database public archives. These can be accessed through the Entrez search and retrieval system at http://www.ncbi.nlm.nih.gov/genome. Next-generation sequencing has enabled researchers to perform genomic sequencing at rates that were unimaginable in the past. Microbial genomes can now be sequenced in a matter of hours, which has led to a significant increase in the number of assembled genomes deposited in the public archives. This huge increase in DNA sequence data presents new challenges for the annotation, analysis and visualization bioinformatics tools. New strategies have been developed for the annotation and representation of reference genomes and sequence variations derived from population studies and clinical outbreaks.
The carbohydrate-active enzymes database (CAZy) in 2013
Lombard, V; Ramulu, HG; Drula, E; Coutinho, PM; Henrissat, B
Nucleic Acids Res. 2014, 42, D490-D495
Free Full Text
The Carbohydrate-Active Enzymes database (CAZy; http://www.cazy.org) provides online and continuously updated access to a sequence-based family classification linking the sequence to the specificity and 3D structure of the enzymes that assemble, modify and breakdown oligo-and polysaccharides. Functional and 3D structural information is added and curated on a regular basis based on the available literature. In addition to the use of the database by enzymologists seeking curated information on CAZymes, the dissemination of a stable nomenclature for these enzymes is probably a major contribution of CAZy. The past few years have seen the expansion of the CAZy classification scheme to new families, the development of subfamilies in several families and the power of CAZy for the analysis of genomes and metagenomes. This article outlines the changes that have occurred in CAZy during the past 5 years and presents our novel effort to display the resolution and the carbohydrate ligands in crystallographic complexes of CAZymes.
The MetaCyc database of metabolic pathways and enzymes and the BioCyc collection of Pathway/Genome Databases
Caspi, R; Altman, T; Billington, R; Dreher, K; Foerster, H; Fulcher, CA; Holland, TA; Keseler, IM; Kothari, A; Kubo, A; Krummenacker, M; Latendresse, M; Mueller, LA; Ong, Q; Paley, S; Subhraveti, P; Weaver, DS; Weerasinghe, D; Zhang, PF; Karp, PD
Nucleic Acids Res. 2014, 42, D459-D471
Free Full Text
The MetaCyc database (MetaCyc.org) is a comprehensive and freely accessible database describing metabolic pathways and enzymes from all domains of life. MetaCyc pathways are experimentally determined, mostly small-molecule metabolic pathways and are curated from the primary scientific literature. MetaCyc contains >2100 pathways derived from >37 000 publications, and is the largest curated collection of metabolic pathways currently available. BioCyc (BioCyc.org) is a collection of >3000 organism-specific Pathway/Genome Databases (PGDBs), each containing the full genome and predicted metabolic network of one organism, including metabolites, enzymes, reactions, metabolic pathways, predicted operons, transport systems and pathway-hole fillers. Additions to BioCyc over the past 2 years include YeastCyc, a PGDB for Saccharomyces cerevisiae, and 891 new genomes from the Human Microbiome Project...
- About this journal
- NAR Methods online
- 2015 Database Issue
- 2014 Web Server Issue
- NAR Special Collections
- Referee Information
- Rights & Permissions
- Dispatch date of the next issue
- This journal is a member of the Committee on Publication Ethics (COPE)
- view Recent Comments on articles
- We are mobile – find out more
- Journals Career Network
Impact factor: 8.808
5-Yr impact factor: 8.378
Senior Executive Editors
- Instructions to authors
- Scope and Criteria for Consideration
- Submit a manuscript now
- Self-archiving policy
Open access options for authors