Skip Navigation

NAR Top Articles - Database


View all categories

April 2015

Pfam: the protein families database
Finn, RD; Bateman, A; Clements, J; Coggill, P; Eberhardt, RY; Eddy, SR; Heger, A; Hetherington, K; Holm, L; Mistry, J; Sonnhammer, ELL; Tate, J; Punta, M
Nucleic Acids Res. 2014, 42, D222-D230
Free Full Text
Pfam, available via servers in the UK ( and the USA (, is a widely used database of protein families, containing 14 831 manually curated entries in the current release, version 27.0. Since the last update article 2 years ago, we have generated 1182 new families and maintained sequence coverage of the UniProt Knowledgebase (UniProtKB) at nearly 80%, despite a 50% increase in the size of the underlying sequence database. Since our 2012 article describing Pfam, we have also undertaken a comprehensive review of the features that are provided by Pfam over and above the basic family data. For each feature, we determined the relevance, computational burden, usage statistics and the functionality of the feature in a website context. As a consequence of this review, we have removed some features, enhanced others and developed new ones to meet the changing demands of computational biology. Here, we describe the changes to Pfam content. Notably, we now provide family alignments based on four different representative proteome sequence data sets and a new interactive DNA search interface...

The NHGRI GWAS Catalog, a curated resource of SNP-trait associations
Welter, D; MacArthur, J; Morales, J; Burdett, T; Hall, P; Junkins, H; Klemm, A; Flicek, P; Manolio, T; Hindorff, L; Parkinson, H
Nucleic Acids Res. 2014, 42, D1001-D1006
Free Full Text
The National Human Genome Research Institute (NHGRI) Catalog of Published Genome-Wide Association Studies (GWAS) Catalog provides a publicly available manually curated collection of published GWAS assaying at least 100 000 single-nucleotide polymorphisms (SNPs) and all SNP-trait associations with P < 1 x 10(-5). The Catalog includes 1751 curated publications of 11 912 SNPs. In addition to the SNP-trait association data, the Catalog also publishes a quarterly diagram of all SNP-trait associations mapped to the SNPs' chromosomal locations. The Catalog can be accessed via a tabular web interface, via a dynamic visualization on the human karyotype, as a downloadable tab-delimited file and as an OWL knowledge base. This article presents a number of recent improvements to the Catalog, including novel ways for users to interact with the Catalog and changes to the curation infrastructure.

JASPAR 2014: an extensively expanded and updated open-access database of transcription factor binding profiles
Mathelier, A; Zhao, XB; Zhang, AW; Parcy, F; Worsley-Hunt, R; Arenillas, DJ; Buchman, S; Chen, CY; Chou, A; Ienasescu, H; Lim, J; Shyr, C; Tan, G; Zhou, M; Lenhard, B; Sandelin, A; Wasserman, WW
Nucleic Acids Res. 2014, 42, D142-D147
Free Full Text
JASPAR ( is the largest open-access database of matrix-based nucleotide profiles describing the binding preference of transcription factors from multiple species. The fifth major release greatly expands the heart of JASPAR-the JASPAR CORE subcollection, which contains curated, non-redundant profiles-with 135 new curated profiles (74 in vertebrates, 8 in Drosophila melanogaster, 10 in Caenorhabditis elegans and 43 in Arabidopsis thaliana; a 30% increase in total) and 43 older updated profiles (36 in vertebrates, 3 in D. melanogaster and 4 in A. thaliana; a 9% update in total). The new and updated profiles are mainly derived from published chromatin immunoprecipitation-seq experimental datasets. In addition, the web interface has been enhanced with advanced capabilities in browsing, searching and subsetting. Finally, the new JASPAR release is accompanied by a new BioPython package, a new R tool package and a new R/Bioconductor data package to facilitate access for both manual and automated methods.

starBase v2.0: decoding miRNA-ceRNA, miRNA-ncRNA and protein-RNA interaction networks from large-scale CLIP-Seq data
Li, JH; Liu, S; Zhou, H; Qu, LH; Yang, JH
Nucleic Acids Res. 2014, 42, D92-D97
Free Full Text
Although microRNAs (miRNAs), other non-coding RNAs (ncRNAs) (e.g. lncRNAs, pseudogenes and circRNAs) and competing endogenous RNAs (ceRNAs) have been implicated in cell-fate determination and in various human diseases, surprisingly little is known about the regulatory interaction networks among the multiple classes of RNAs. In this study, we developed starBase v2.0 ( to systematically identify the RNA-RNA and protein-RNA interaction networks from 108 CLIP-Seq (PAR-CLIP, HITS-CLIP, iCLIP, CLASH) data sets generated by 37 independent studies. By analyzing millions of RNA-binding protein binding sites, we identified similar to 9000 miRNA-circRNA, 16 000 miRNA-pseudogene and 285 000 protein-RNA regulatory relationships. Moreover, starBase v2.0 has been updated to provide the most comprehensive CLIP-Seq experimentally supported miRNA-mRNA and miRNA-lncRNA interaction networks to date. We identified similar to 10 000 ceRNA pairs from CLIP-supported miRNA target sites. By combining 13 functional genomic annotations, we developed miRFunction and ceRNAFunction web servers to predict the function of miRNAs and other ncRNAs...

STRING v9.1: protein-protein interaction networks, with increased coverage and integration
Franceschini, A; Szklarczyk, D; Frankild, S; Kuhn, M; Simonovic, M; Roth, A; Lin, JY; Minguez, P; Bork, P; von Mering, C; Jensen, LJ
Nucleic Acids Res. 2013, 41, D808-D815
Free Full Text
Complete knowledge of all direct and indirect interactions between proteins in a given cell would represent an important milestone towards a comprehensive description of cellular mechanisms and functions. Although this goal is still elusive, considerable progress has been made-particularly for certain model organisms and functional systems. Currently, protein interactions and associations are annotated at various levels of detail in online resources, ranging from raw data repositories to highly formalized pathway databases. For many applications, a global view of all the available interaction data is desirable, including lower-quality data and/or computational predictions. The STRING database ( aims to provide such a global perspective for as many organisms as feasible. Known and predicted associations are scored and integrated, resulting in comprehensive protein networks covering > 1100 organisms. Here, we describe the update to version 9.1 of STRING, introducing several improvements...

The Human Phenotype Ontology project: linking molecular biology and disease through phenotype data
Kohler, S; Doelken, SC; Mungall, CJ; Bauer, S; Firth, HV; Bailleul-Forestier, I; Black, GCM; Brown, DL; Brudno, M; Campbell, J; FitzPatrick, DR; Eppig, JT; Jackson, AP; Freson, K; Girdea, M; Helbig, I; Hurst, JA; Jahn, J; Jackson, LG; Kelly, AM; Ledbetter
Nucleic Acids Res. 2014, 42, D966-D974
Free Full Text
The Human Phenotype Ontology (HPO) project, available at, provides a structured, comprehensive and well-defined set of 10,088 classes (terms) describing human phenotypic abnormalities and 13,326 subclass relations between the HPO classes. In addition we have developed logical definitions for 46% of all HPO classes using terms from ontologies for anatomy, cell types, function, embryology, pathology and other domains. This allows interoperability with several resources, especially those containing phenotype information on model organisms such as mouse and zebrafish. Here we describe the updated HPO database, which provides annotations of 7,278 human hereditary syndromes listed in OMIM, Orphanet and DECIPHER to classes of the HPO. Various meta-attributes such as frequency, references and negations are associated with each annotation. Several large-scale projects worldwide utilize the HPO for describing phenotype information in their datasets...

TFBSshape: a motif database for DNA shape features of transcription factor binding sites
Yang, L; Zhou, TY; Dror, I; Mathelier, A; Wasserman, WW; Gordan, R; Rohs, R
Nucleic Acids Res. 2014, 42, D148-D155
Free Full Text
Transcription factor binding sites (TFBSs) are most commonly characterized by the nucleotide preferences at each position of the DNA target. Whereas these sequence motifs are quite accurate descriptions of DNA binding specificities of transcription factors (TFs), proteins recognize DNA as a three-dimensional object. DNA structural features refine the description of TF binding specificities and provide mechanistic insights into protein-DNA recognition. Existing motif databases contain extensive nucleotide sequences identified in binding experiments based on their selection by a TF. To utilize DNA shape information when analysing the DNA binding specificities of TFs, we developed a new tool, the TFBSshape database (available at, for calculating DNA structural features from nucleotide sequences provided by motif databases. The TFBSshape database can be used to generate heat maps and quantitative data for DNA structural features (i.e., minor groove width, roll, propeller twist and helix twist) for 739 TF datasets from 23 different species derived from the motif databases JASPAR and UniPROBE...

RefSeq microbial genomes database: new representation and annotation strategy
Tatusova, T; Ciufo, S; Fedorov, B; O'Neill, K; Tolstoy, I
Nucleic Acids Res. 2014, 42, D553-D559
Free Full Text
The source of the microbial genomic sequences in the RefSeq collection is the set of primary sequence records submitted to the International Nucleotide Sequence Database public archives. These can be accessed through the Entrez search and retrieval system at Next-generation sequencing has enabled researchers to perform genomic sequencing at rates that were unimaginable in the past. Microbial genomes can now be sequenced in a matter of hours, which has led to a significant increase in the number of assembled genomes deposited in the public archives. This huge increase in DNA sequence data presents new challenges for the annotation, analysis and visualization bioinformatics tools. New strategies have been developed for the annotation and representation of reference genomes and sequence variations derived from population studies and clinical outbreaks.

Data, information, knowledge and principle: back to metabolism in KEGG
Kanehisa, M; Goto, S; Sato, Y; Kawashima, M; Furumichi, M; Tanabe, M
Nucleic Acids Res. 2014, 42, D199-D205
Free Full Text
In the hierarchy of data, information and knowledge, computational methods play a major role in the initial processing of data to extract information, but they alone become less effective to compile knowledge from information. The Kyoto Encyclopedia of Genes and Genomes (KEGG) resource ( or has been developed as a reference knowledge base to assist this latter process. In particular, the KEGG pathway maps are widely used for biological interpretation of genome sequences and other high-throughput data. The link from genomes to pathways is made through the KEGG Orthology system, a collection of manually defined ortholog groups identified by K numbers. To better automate this interpretation process the KEGG modules defined by Boolean expressions of K numbers have been expanded and improved. Once genes in a genome are annotated with K numbers, the KEGG modules can be computationally evaluated revealing metabolic capacities and other phenotypic features. The reaction modules, which represent chemical units of reactions, have been used to analyze design principles of metabolic networks and also to improve the definition of K numbers...

miRBase: annotating high confidence microRNAs using deep sequencing data
Kozomara, A; Griffiths-Jones, S
Nucleic Acids Res. 2014, 42, D68-D73
Free Full Text
We describe an update of the miRBase database (, the primary microRNA sequence repository. The latest miRBase release (v20, June 2013) contains 24 521 microRNA loci from 206 species, processed to produce 30 424 mature microRNA products. The rate of deposition of novel microRNAs and the number of researchers involved in their discovery continue to increase, driven largely by small RNA deep sequencing experiments. In the face of these increases, and a range of microRNA annotation methods and criteria, maintaining the quality of the microRNA sequence data set is a significant challenge. Here, we describe recent developments of the miRBase database to address this issue. In particular, we describe the collation and use of deep sequencing data sets to assign levels of confidence to miRBase entries. We now provide a high confidence subset of miRBase entries, based on the pattern of mapped reads. The high confidence microRNA data set is available alongside the complete microRNA collection at We also describe embedding microRNA-specific Wikipedia pages on the miRBase website to encourage the microRNA community to contribute and share textual and functional information.

Back to the top