Skip Navigation

NAR Top Articles - Database


View all categories

July 2015

Pfam: the protein families database
Finn, RD; Bateman, A; Clements, J; Coggill, P; Eberhardt, RY; Eddy, SR; Heger, A; Hetherington, K; Holm, L; Mistry, J; Sonnhammer, ELL; Tate, J; Punta, M
Nucleic Acids Res. 2014, 42, D222-D230
Free Full Text
Pfam, available via servers in the UK ( and the USA (, is a widely used database of protein families, containing 14 831 manually curated entries in the current release, version 27.0. Since the last update article 2 years ago, we have generated 1182 new families and maintained sequence coverage of the UniProt Knowledgebase (UniProtKB) at nearly 80%, despite a 50% increase in the size of the underlying sequence database. Since our 2012 article describing Pfam, we have also undertaken a comprehensive review of the features that are provided by Pfam over and above the basic family data. For each feature, we determined the relevance, computational burden, usage statistics and the functionality of the feature in a website context. As a consequence of this review, we have removed some features, enhanced others and developed new ones to meet the changing demands of computational biology. Here, we describe the changes to Pfam content. Notably, we now provide family alignments based on four different representative proteome sequence data sets and a new interactive DNA search interface...

The ConsensusPathDB interaction database: 2013 update
Kamburov, A; Stelzl, U; Lehrach, H; Herwig, R
Nucleic Acids Res. 2013, 41, D793-D800
Free Full Text
Knowledge of the various interactions between molecules in the cell is crucial for understanding cellular processes in health and disease. Currently available interaction databases, being largely complementary to each other, must be integrated to obtain a comprehensive global map of the different types of interactions. We have previously reported the development of an integrative interaction database called ConsensusPathDB ( that aims to fulfill this task. In this update article, we report its significant progress in terms of interaction content and web interface tools. ConsensusPathDB has grown mainly due to the integration of 12 further databases; it now contains 215 541 unique interactions and 4601 pathways from overall 30 databases. Binary protein interactions are scored with our confidence assessment tool, IntScore. The ConsensusPathDB web interface allows users to take advantage of these integrated interaction and pathway data in different contexts. Recent developments include pathway analysis of metabolite lists, visualization of functional gene/metabolite sets as overlap graphs, gene set analysis based on protein complexes and induced network modules analysis...

JASPAR 2014: an extensively expanded and updated open-access database of transcription factor binding profiles
Mathelier, A; Zhao, XB; Zhang, AW; Parcy, F; Worsley-Hunt, R; Arenillas, DJ; Buchman, S; Chen, CY; Chou, A; Ienasescu, H; Lim, J; Shyr, C; Tan, G; Zhou, M; Lenhard, B; Sandelin, A; Wasserman, WW
Nucleic Acids Res. 2014, 42, D142-D147
Free Full Text
JASPAR ( is the largest open-access database of matrix-based nucleotide profiles describing the binding preference of transcription factors from multiple species. The fifth major release greatly expands the heart of JASPAR-the JASPAR CORE subcollection, which contains curated, non-redundant profiles-with 135 new curated profiles (74 in vertebrates, 8 in Drosophila melanogaster, 10 in Caenorhabditis elegans and 43 in Arabidopsis thaliana; a 30% increase in total) and 43 older updated profiles (36 in vertebrates, 3 in D. melanogaster and 4 in A. thaliana; a 9% update in total). The new and updated profiles are mainly derived from published chromatin immunoprecipitation-seq experimental datasets. In addition, the web interface has been enhanced with advanced capabilities in browsing, searching and subsetting. Finally, the new JASPAR release is accompanied by a new BioPython package, a new R tool package and a new R/Bioconductor data package to facilitate access for both manual and automated methods.

starBase v2.0: decoding miRNA-ceRNA, miRNA-ncRNA and protein-RNA interaction networks from large-scale CLIP-Seq data
Li, JH; Liu, S; Zhou, H; Qu, LH; Yang, JH
Nucleic Acids Res. 2014, 42, D92-D97
Free Full Text
Although microRNAs (miRNAs), other non-coding RNAs (ncRNAs) (e.g. lncRNAs, pseudogenes and circRNAs) and competing endogenous RNAs (ceRNAs) have been implicated in cell-fate determination and in various human diseases, surprisingly little is known about the regulatory interaction networks among the multiple classes of RNAs. In this study, we developed starBase v2.0 ( to systematically identify the RNA-RNA and protein-RNA interaction networks from 108 CLIP-Seq (PAR-CLIP, HITS-CLIP, iCLIP, CLASH) data sets generated by 37 independent studies. By analyzing millions of RNA-binding protein binding sites, we identified similar to 9000 miRNA-circRNA, 16 000 miRNA-pseudogene and 285 000 protein-RNA regulatory relationships. Moreover, starBase v2.0 has been updated to provide the most comprehensive CLIP-Seq experimentally supported miRNA-mRNA and miRNA-lncRNA interaction networks to date. We identified similar to 10 000 ceRNA pairs from CLIP-supported miRNA target sites. By combining 13 functional genomic annotations, we developed miRFunction and ceRNAFunction web servers to predict the function of miRNAs and other ncRNAs...

The NHGRI GWAS Catalog, a curated resource of SNP-trait associations
Welter, D; MacArthur, J; Morales, J; Burdett, T; Hall, P; Junkins, H; Klemm, A; Flicek, P; Manolio, T; Hindorff, L; Parkinson, H
Nucleic Acids Res. 2014, 42, D1001-D1006
Free Full Text
The National Human Genome Research Institute (NHGRI) Catalog of Published Genome-Wide Association Studies (GWAS) Catalog provides a publicly available manually curated collection of published GWAS assaying at least 100 000 single-nucleotide polymorphisms (SNPs) and all SNP-trait associations with P < 1 x 10(-5). The Catalog includes 1751 curated publications of 11 912 SNPs. In addition to the SNP-trait association data, the Catalog also publishes a quarterly diagram of all SNP-trait associations mapped to the SNPs' chromosomal locations. The Catalog can be accessed via a tabular web interface, via a dynamic visualization on the human karyotype, as a downloadable tab-delimited file and as an OWL knowledge base. This article presents a number of recent improvements to the Catalog, including novel ways for users to interact with the Catalog and changes to the curation infrastructure.

APPRIS: annotation of principal and alternative splice isoforms
Rodriguez, JM; Maietta, P; Ezkurdia, I; Pietrelli, A; Wesselink, JJ; Lopez, G; Valencia, A; Tress, ML
Nucleic Acids Res. 2013, 41, D110-D117
Free Full Text
Here, we present APPRIS (, a database that houses annotations of human splice isoforms. APPRIS has been designed to provide value to manual annotations of the human genome by adding reliable protein structural and functional data and information from cross-species conservation. The visual representation of the annotations provided by APPRIS for each gene allows annotators and researchers alike to easily identify functional changes brought about by splicing events. In addition to collecting, integrating and analyzing reliable predictions of the effect of splicing events, APPRIS also selects a single reference sequence for each gene, here termed the principal isoform, based on the annotations of structure, function and conservation for each transcript. APPRIS identifies a principal isoform for 85% of the protein-coding genes in the GENCODE 7 release for ENSEMBL. Analysis of the APPRIS data shows that at least 70% of the alternative (non-principal) variants would lose important functional or structural information relative to the principal isoform.

miRBase: annotating high confidence microRNAs using deep sequencing data
Kozomara, A; Griffiths-Jones, S
Nucleic Acids Res. 2014, 42, D68-D73
Free Full Text
We describe an update of the miRBase database (, the primary microRNA sequence repository. The latest miRBase release (v20, June 2013) contains 24 521 microRNA loci from 206 species, processed to produce 30 424 mature microRNA products. The rate of deposition of novel microRNAs and the number of researchers involved in their discovery continue to increase, driven largely by small RNA deep sequencing experiments. In the face of these increases, and a range of microRNA annotation methods and criteria, maintaining the quality of the microRNA sequence data set is a significant challenge. Here, we describe recent developments of the miRBase database to address this issue. In particular, we describe the collation and use of deep sequencing data sets to assign levels of confidence to miRBase entries. We now provide a high confidence subset of miRBase entries, based on the pattern of mapped reads. The high confidence microRNA data set is available alongside the complete microRNA collection at We also describe embedding microRNA-specific Wikipedia pages on the miRBase website to encourage the microRNA community to contribute and share textual and functional information.

The Human Phenotype Ontology project: linking molecular biology and disease through phenotype data
Kohler, S; Doelken, SC; Mungall, CJ; Bauer, S; Firth, HV; Bailleul-Forestier, I; Black, GCM; Brown, DL; Brudno, M; Campbell, J; FitzPatrick, DR; Eppig, JT; Jackson, AP; Freson, K; Girdea, M; Helbig, I; Hurst, JA; Jahn, J; Jackson, LG; Kelly, AM; Ledbetter
Nucleic Acids Res. 2014, 42, D966-D974
Free Full Text
The Human Phenotype Ontology (HPO) project, available at, provides a structured, comprehensive and well-defined set of 10,088 classes (terms) describing human phenotypic abnormalities and 13,326 subclass relations between the HPO classes. In addition we have developed logical definitions for 46% of all HPO classes using terms from ontologies for anatomy, cell types, function, embryology, pathology and other domains. This allows interoperability with several resources, especially those containing phenotype information on model organisms such as mouse and zebrafish. Here we describe the updated HPO database, which provides annotations of 7,278 human hereditary syndromes listed in OMIM, Orphanet and DECIPHER to classes of the HPO. Various meta-attributes such as frequency, references and negations are associated with each annotation. Several large-scale projects worldwide utilize the HPO for describing phenotype information in their datasets...

Data, information, knowledge and principle: back to metabolism in KEGG
Kanehisa, M; Goto, S; Sato, Y; Kawashima, M; Furumichi, M; Tanabe, M
Nucleic Acids Res. 2014, 42, D199-D205
Free Full Text
In the hierarchy of data, information and knowledge, computational methods play a major role in the initial processing of data to extract information, but they alone become less effective to compile knowledge from information. The Kyoto Encyclopedia of Genes and Genomes (KEGG) resource ( or has been developed as a reference knowledge base to assist this latter process. In particular, the KEGG pathway maps are widely used for biological interpretation of genome sequences and other high-throughput data. The link from genomes to pathways is made through the KEGG Orthology system, a collection of manually defined ortholog groups identified by K numbers. To better automate this interpretation process the KEGG modules defined by Boolean expressions of K numbers have been expanded and improved. Once genes in a genome are annotated with K numbers, the KEGG modules can be computationally evaluated revealing metabolic capacities and other phenotypic features. The reaction modules, which represent chemical units of reactions, have been used to analyze design principles of metabolic networks and also to improve the definition of K numbers...

NCBI GEO: archive for functional genomics data sets--update
Barrett, T; Wilhite, SE; Ledoux, P; Evangelista, C; Kim, IF; Tomashevsky, M; Marshall, KA; Phillippy, KH; Sherman, PM; Holko, M; Yefanov, A; Lee, H; Zhang, NG; Robertson, CL; Serova, N; Davis, S; Soboleva, A
Nucleic Acids Res. 2013, 41, D991-D995
Free Full Text
The Gene Expression Omnibus (GEO, is an international public repository for high-throughput microarray and next-generation sequence functional genomic data sets submitted by the research community. The resource supports archiving of raw data, processed data and metadata which are indexed, cross-linked and searchable. All data are freely available for download in a variety of formats. GEO also provides several web-based tools and strategies to assist users to query, analyse and visualize data. This article reports current status and recent database developments, including the release of GEO2R, an R-based web application that helps users analyse GEO data.

Back to the top