Skip Navigation

NAR Top Articles - Database


View all categories

February 2015

Pfam: the protein families database
Finn, RD; Bateman, A; Clements, J; Coggill, P; Eberhardt, RY; Eddy, SR; Heger, A; Hetherington, K; Holm, L; Mistry, J; Sonnhammer, ELL; Tate, J; Punta, M
Nucleic Acids Res. 2014, 42, D222-D230
Free Full Text
Pfam, available via servers in the UK ( and the USA (, is a widely used database of protein families, containing 14 831 manually curated entries in the current release, version 27.0. Since the last update article 2 years ago, we have generated 1182 new families and maintained sequence coverage of the UniProt Knowledgebase (UniProtKB) at nearly 80%, despite a 50% increase in the size of the underlying sequence database. Since our 2012 article describing Pfam, we have also undertaken a comprehensive review of the features that are provided by Pfam over and above the basic family data. For each feature, we determined the relevance, computational burden, usage statistics and the functionality of the feature in a website context. As a consequence of this review, we have removed some features, enhanced others and developed new ones to meet the changing demands of computational biology. Here, we describe the changes to Pfam content. Notably, we now provide family alignments based on four different representative proteome sequence data sets and a new interactive DNA search interface...

The NHGRI GWAS Catalog, a curated resource of SNP-trait associations
Welter, D; MacArthur, J; Morales, J; Burdett, T; Hall, P; Junkins, H; Klemm, A; Flicek, P; Manolio, T; Hindorff, L; Parkinson, H
Nucleic Acids Res. 2014, 42, D1001-D1006
Free Full Text
The National Human Genome Research Institute (NHGRI) Catalog of Published Genome-Wide Association Studies (GWAS) Catalog provides a publicly available manually curated collection of published GWAS assaying at least 100 000 single-nucleotide polymorphisms (SNPs) and all SNP-trait associations with P < 1 x 10(-5). The Catalog includes 1751 curated publications of 11 912 SNPs. In addition to the SNP-trait association data, the Catalog also publishes a quarterly diagram of all SNP-trait associations mapped to the SNPs' chromosomal locations. The Catalog can be accessed via a tabular web interface, via a dynamic visualization on the human karyotype, as a downloadable tab-delimited file and as an OWL knowledge base. This article presents a number of recent improvements to the Catalog, including novel ways for users to interact with the Catalog and changes to the curation infrastructure.

JASPAR 2014: an extensively expanded and updated open-access database of transcription factor binding profiles
Mathelier, A; Zhao, XB; Zhang, AW; Parcy, F; Worsley-Hunt, R; Arenillas, DJ; Buchman, S; Chen, CY; Chou, A; Ienasescu, H; Lim, J; Shyr, C; Tan, G; Zhou, M; Lenhard, B; Sandelin, A; Wasserman, WW
Nucleic Acids Res. 2014, 42, D142-D147
Free Full Text
JASPAR ( is the largest open-access database of matrix-based nucleotide profiles describing the binding preference of transcription factors from multiple species. The fifth major release greatly expands the heart of JASPAR-the JASPAR CORE subcollection, which contains curated, non-redundant profiles-with 135 new curated profiles (74 in vertebrates, 8 in Drosophila melanogaster, 10 in Caenorhabditis elegans and 43 in Arabidopsis thaliana; a 30% increase in total) and 43 older updated profiles (36 in vertebrates, 3 in D. melanogaster and 4 in A. thaliana; a 9% update in total). The new and updated profiles are mainly derived from published chromatin immunoprecipitation-seq experimental datasets. In addition, the web interface has been enhanced with advanced capabilities in browsing, searching and subsetting. Finally, the new JASPAR release is accompanied by a new BioPython package, a new R tool package and a new R/Bioconductor data package to facilitate access for both manual and automated methods.

starBase v2.0: decoding miRNA-ceRNA, miRNA-ncRNA and protein-RNA interaction networks from large-scale CLIP-Seq data
Li, JH; Liu, S; Zhou, H; Qu, LH; Yang, JH
Nucleic Acids Res. 2014, 42, D92-D97
Free Full Text
Although microRNAs (miRNAs), other non-coding RNAs (ncRNAs) (e.g. lncRNAs, pseudogenes and circRNAs) and competing endogenous RNAs (ceRNAs) have been implicated in cell-fate determination and in various human diseases, surprisingly little is known about the regulatory interaction networks among the multiple classes of RNAs. In this study, we developed starBase v2.0 ( to systematically identify the RNA-RNA and protein-RNA interaction networks from 108 CLIP-Seq (PAR-CLIP, HITS-CLIP, iCLIP, CLASH) data sets generated by 37 independent studies. By analyzing millions of RNA-binding protein binding sites, we identified similar to 9000 miRNA-circRNA, 16 000 miRNA-pseudogene and 285 000 protein-RNA regulatory relationships. Moreover, starBase v2.0 has been updated to provide the most comprehensive CLIP-Seq experimentally supported miRNA-mRNA and miRNA-lncRNA interaction networks to date. We identified similar to 10 000 ceRNA pairs from CLIP-supported miRNA target sites. By combining 13 functional genomic annotations, we developed miRFunction and ceRNAFunction web servers to predict the function of miRNAs and other ncRNAs...

Expression Atlas update--a database of gene and transcript expression from microarray- and sequencing-based functional genomics experiments
Petryszak, R; Burdett, T; Fiorelli, B; Fonseca, NA; Gonzalez-Porta, M; Hastings, E; Huber, W; Jupp, S; Keays, M; Kryvych, N; McMurry, J; Marioni, JC; Malone, J; Megy, K; Rustici, G; Tang, AY; Taubert, J; Williams, E; Mannion, O; Parkinson, HE; Brazma, A
Nucleic Acids Res. 2014, 42, D926-D932
Free Full Text
Expression Atlas ( is a value-added database providing information about gene, protein and splice variant expression in different cell types, organism parts, developmental stages, diseases and other biological and experimental conditions. The database consists of selected high-quality microarray and RNA-sequencing experiments from ArrayExpress that have been manually curated, annotated with Experimental Factor Ontology terms and processed using standardized microarray and RNA-sequencing analysis methods. The new version of Expression Atlas introduces the concept of 'baseline' expression, i.e. gene and splice variant abundance levels in healthy or untreated conditions, such as tissues or cell types. Differential gene expression data benefit from an in-depth curation of experimental intent, resulting in biologically meaningful 'contrasts'..

NONCODEv4: exploring the world of long non-coding RNA genes
Xie, CY; Yuan, J; Li, H; Li, M; Zhao, GG; Bu, DC; Zhu, WM; Wu, W; Chen, RS; Zhao, Y
Nucleic Acids Res. 2014, 42, D98-D103
Free Full Text
NONCODE ( is an integrated knowledge database dedicated to non-coding RNAs (excluding tRNAs and rRNAs). Non-coding RNAs (ncRNAs) have been implied in diseases and identified to play important roles in various biological processes. Since NONCODE version 3.0 was released 2 years ago, discovery of novel ncRNAs has been promoted by high-throughput RNA sequencing (RNA-Seq). In this update of NONCODE, we expand the ncRNA data set by collection of newly identified ncRNAs from literature published in the last 2 years and integration of the latest version of RefSeq and Ensembl. Particularly, the number of long non-coding RNA (lncRNA) has increased sharply from 73 327 to 210 831. Owing to similar alternative splicing pattern to mRNAs, the concept of lncRNA genes was put forward to help systematic understanding of lncRNAs. The 56 018 and 46 475 lncRNA genes were generated from 95 135 and 67 628 lncRNAs for human and mouse, respectively. Additionally, we present expression profile of lncRNA genes by graphs based on public RNA-seq data for human and mouse, as well as predict functions of these lncRNA genes...

STRING v9.1: protein-protein interaction networks, with increased coverage and integration
Franceschini, A; Szklarczyk, D; Frankild, S; Kuhn, M; Simonovic, M; Roth, A; Lin, JY; Minguez, P; Bork, P; von Mering, C; Jensen, LJ
Nucleic Acids Res. 2013, 41, D808-D815
Free Full Text
Complete knowledge of all direct and indirect interactions between proteins in a given cell would represent an important milestone towards a comprehensive description of cellular mechanisms and functions. Although this goal is still elusive, considerable progress has been made-particularly for certain model organisms and functional systems. Currently, protein interactions and associations are annotated at various levels of detail in online resources, ranging from raw data repositories to highly formalized pathway databases. For many applications, a global view of all the available interaction data is desirable, including lower-quality data and/or computational predictions. The STRING database ( aims to provide such a global perspective for as many organisms as feasible. Known and predicted associations are scored and integrated, resulting in comprehensive protein networks covering > 1100 organisms. Here, we describe the update to version 9.1 of STRING, introducing several improvements...

miRBase: annotating high confidence microRNAs using deep sequencing data
Kozomara, A; Griffiths-Jones, S
Nucleic Acids Res. 2014, 42, D68-D73
Free Full Text
We describe an update of the miRBase database (, the primary microRNA sequence repository. The latest miRBase release (v20, June 2013) contains 24 521 microRNA loci from 206 species, processed to produce 30 424 mature microRNA products. The rate of deposition of novel microRNAs and the number of researchers involved in their discovery continue to increase, driven largely by small RNA deep sequencing experiments. In the face of these increases, and a range of microRNA annotation methods and criteria, maintaining the quality of the microRNA sequence data set is a significant challenge. Here, we describe recent developments of the miRBase database to address this issue. In particular, we describe the collation and use of deep sequencing data sets to assign levels of confidence to miRBase entries. We now provide a high confidence subset of miRBase entries, based on the pattern of mapped reads. The high confidence microRNA data set is available alongside the complete microRNA collection at We also describe embedding microRNA-specific Wikipedia pages on the miRBase website to encourage the microRNA community to contribute and share textual and functional information.

Data, information, knowledge and principle: back to metabolism in KEGG
Kanehisa, M; Goto, S; Sato, Y; Kawashima, M; Furumichi, M; Tanabe, M
Nucleic Acids Res. 2014, 42, D199-D205
Free Full Text
In the hierarchy of data, information and knowledge, computational methods play a major role in the initial processing of data to extract information, but they alone become less effective to compile knowledge from information. The Kyoto Encyclopedia of Genes and Genomes (KEGG) resource ( or has been developed as a reference knowledge base to assist this latter process. In particular, the KEGG pathway maps are widely used for biological interpretation of genome sequences and other high-throughput data. The link from genomes to pathways is made through the KEGG Orthology system, a collection of manually defined ortholog groups identified by K numbers. To better automate this interpretation process the KEGG modules defined by Boolean expressions of K numbers have been expanded and improved. Once genes in a genome are annotated with K numbers, the KEGG modules can be computationally evaluated revealing metabolic capacities and other phenotypic features. The reaction modules, which represent chemical units of reactions, have been used to analyze design principles of metabolic networks and also to improve the definition of K numbers...

The Human Phenotype Ontology project: linking molecular biology and disease through phenotype data
Kohler, S; Doelken, SC; Mungall, CJ; Bauer, S; Firth, HV; Bailleul-Forestier, I; Black, GCM; Brown, DL; Brudno, M; Campbell, J; FitzPatrick, DR; Eppig, JT; Jackson, AP; Freson, K; Girdea, M; Helbig, I; Hurst, JA; Jahn, J; Jackson, LG; Kelly, AM; Ledbetter
Nucleic Acids Res. 2014, 42, D966-D974
Free Full Text
The Human Phenotype Ontology (HPO) project, available at, provides a structured, comprehensive and well-defined set of 10,088 classes (terms) describing human phenotypic abnormalities and 13,326 subclass relations between the HPO classes. In addition we have developed logical definitions for 46% of all HPO classes using terms from ontologies for anatomy, cell types, function, embryology, pathology and other domains. This allows interoperability with several resources, especially those containing phenotype information on model organisms such as mouse and zebrafish. Here we describe the updated HPO database, which provides annotations of 7,278 human hereditary syndromes listed in OMIM, Orphanet and DECIPHER to classes of the HPO. Various meta-attributes such as frequency, references and negations are associated with each annotation. Several large-scale projects worldwide utilize the HPO for describing phenotype information in their datasets...

Back to the top