Skip Navigation

NAR Top Articles - Database


View all categories

August 2015

Pfam: the protein families database
Finn, RD; Bateman, A; Clements, J; Coggill, P; Eberhardt, RY; Eddy, SR; Heger, A; Hetherington, K; Holm, L; Mistry, J; Sonnhammer, ELL; Tate, J; Punta, M
Nucleic Acids Res. 2014, 42, D222-D230
Free Full Text
Pfam, available via servers in the UK ( and the USA (, is a widely used database of protein families, containing 14 831 manually curated entries in the current release, version 27.0. Since the last update article 2 years ago, we have generated 1182 new families and maintained sequence coverage of the UniProt Knowledgebase (UniProtKB) at nearly 80%, despite a 50% increase in the size of the underlying sequence database. Since our 2012 article describing Pfam, we have also undertaken a comprehensive review of the features that are provided by Pfam over and above the basic family data. For each feature, we determined the relevance, computational burden, usage statistics and the functionality of the feature in a website context. As a consequence of this review, we have removed some features, enhanced others and developed new ones to meet the changing demands of computational biology. Here, we describe the changes to Pfam content. Notably, we now provide family alignments based on four different representative proteome sequence data sets and a new interactive DNA search interface...

The NHGRI GWAS Catalog, a curated resource of SNP-trait associations
Welter, D; MacArthur, J; Morales, J; Burdett, T; Hall, P; Junkins, H; Klemm, A; Flicek, P; Manolio, T; Hindorff, L; Parkinson, H
Nucleic Acids Res. 2014, 42, D1001-D1006
Free Full Text
The National Human Genome Research Institute (NHGRI) Catalog of Published Genome-Wide Association Studies (GWAS) Catalog provides a publicly available manually curated collection of published GWAS assaying at least 100 000 single-nucleotide polymorphisms (SNPs) and all SNP-trait associations with P < 1 x 10(-5). The Catalog includes 1751 curated publications of 11 912 SNPs. In addition to the SNP-trait association data, the Catalog also publishes a quarterly diagram of all SNP-trait associations mapped to the SNPs' chromosomal locations. The Catalog can be accessed via a tabular web interface, via a dynamic visualization on the human karyotype, as a downloadable tab-delimited file and as an OWL knowledge base. This article presents a number of recent improvements to the Catalog, including novel ways for users to interact with the Catalog and changes to the curation infrastructure.

JASPAR 2014: an extensively expanded and updated open-access database of transcription factor binding profiles
Mathelier, A; Zhao, XB; Zhang, AW; Parcy, F; Worsley-Hunt, R; Arenillas, DJ; Buchman, S; Chen, CY; Chou, A; Ienasescu, H; Lim, J; Shyr, C; Tan, G; Zhou, M; Lenhard, B; Sandelin, A; Wasserman, WW
Nucleic Acids Res. 2014, 42, D142-D147
Free Full Text
JASPAR ( is the largest open-access database of matrix-based nucleotide profiles describing the binding preference of transcription factors from multiple species. The fifth major release greatly expands the heart of JASPAR-the JASPAR CORE subcollection, which contains curated, non-redundant profiles-with 135 new curated profiles (74 in vertebrates, 8 in Drosophila melanogaster, 10 in Caenorhabditis elegans and 43 in Arabidopsis thaliana; a 30% increase in total) and 43 older updated profiles (36 in vertebrates, 3 in D. melanogaster and 4 in A. thaliana; a 9% update in total). The new and updated profiles are mainly derived from published chromatin immunoprecipitation-seq experimental datasets. In addition, the web interface has been enhanced with advanced capabilities in browsing, searching and subsetting. Finally, the new JASPAR release is accompanied by a new BioPython package, a new R tool package and a new R/Bioconductor data package to facilitate access for both manual and automated methods.

starBase v2.0: decoding miRNA-ceRNA, miRNA-ncRNA and protein-RNA interaction networks from large-scale CLIP-Seq data
Li, JH; Liu, S; Zhou, H; Qu, LH; Yang, JH
Nucleic Acids Res. 2014, 42, D92-D97
Free Full Text
Although microRNAs (miRNAs), other non-coding RNAs (ncRNAs) (e.g. lncRNAs, pseudogenes and circRNAs) and competing endogenous RNAs (ceRNAs) have been implicated in cell-fate determination and in various human diseases, surprisingly little is known about the regulatory interaction networks among the multiple classes of RNAs. In this study, we developed starBase v2.0 ( to systematically identify the RNA-RNA and protein-RNA interaction networks from 108 CLIP-Seq (PAR-CLIP, HITS-CLIP, iCLIP, CLASH) data sets generated by 37 independent studies. By analyzing millions of RNA-binding protein binding sites, we identified similar to 9000 miRNA-circRNA, 16 000 miRNA-pseudogene and 285 000 protein-RNA regulatory relationships. Moreover, starBase v2.0 has been updated to provide the most comprehensive CLIP-Seq experimentally supported miRNA-mRNA and miRNA-lncRNA interaction networks to date. We identified similar to 10 000 ceRNA pairs from CLIP-supported miRNA target sites...

The Human Phenotype Ontology project: linking molecular biology and disease through phenotype data
Kohler, S; Doelken, SC; Mungall, CJ; Bauer, S; Firth, HV; Bailleul-Forestier, I; Black, GCM; Brown, DL; Brudno, M; Campbell, J; FitzPatrick, DR; Eppig, JT; Jackson, AP; Freson, K; Girdea, M; Helbig, I; Hurst, JA; Jahn, J; Jackson, LG; Kelly, AM; Ledbetter
Nucleic Acids Res. 2014, 42, D966-D974
Free Full Text
The Human Phenotype Ontology (HPO) project, available at, provides a structured, comprehensive and well-defined set of 10,088 classes (terms) describing human phenotypic abnormalities and 13,326 subclass relations between the HPO classes. In addition we have developed logical definitions for 46% of all HPO classes using terms from ontologies for anatomy, cell types, function, embryology, pathology and other domains. This allows interoperability with several resources, especially those containing phenotype information on model organisms such as mouse and zebrafish. Here we describe the updated HPO database, which provides annotations of 7,278 human hereditary syndromes listed in OMIM, Orphanet and DECIPHER to classes of the HPO. Various meta-attributes such as frequency, references and negations are associated with each annotation. Several large-scale projects worldwide utilize the HPO for describing phenotype information in their datasets...

Data, information, knowledge and principle: back to metabolism in KEGG
Kanehisa, M; Goto, S; Sato, Y; Kawashima, M; Furumichi, M; Tanabe, M
Nucleic Acids Res. 2014, 42, D199-D205
Free Full Text
In the hierarchy of data, information and knowledge, computational methods play a major role in the initial processing of data to extract information, but they alone become less effective to compile knowledge from information. The Kyoto Encyclopedia of Genes and Genomes (KEGG) resource ( or has been developed as a reference knowledge base to assist this latter process. In particular, the KEGG pathway maps are widely used for biological interpretation of genome sequences and other high-throughput data. The link from genomes to pathways is made through the KEGG Orthology system, a collection of manually defined ortholog groups identified by K numbers. To better automate this interpretation process the KEGG modules defined by Boolean expressions of K numbers have been expanded and improved. Once genes in a genome are annotated with K numbers, the KEGG modules can be computationally evaluated revealing metabolic capacities and other phenotypic features. The reaction modules, which represent chemical units of reactions, have been used to analyze design principles of metabolic networks and also to improve the definition of K numbers...

miRBase: annotating high confidence microRNAs using deep sequencing data
Kozomara, A; Griffiths-Jones, S
Nucleic Acids Res. 2014, 42, D68-D73
Free Full Text
We describe an update of the miRBase database (, the primary microRNA sequence repository. The latest miRBase release (v20, June 2013) contains 24 521 microRNA loci from 206 species, processed to produce 30 424 mature microRNA products. The rate of deposition of novel microRNAs and the number of researchers involved in their discovery continue to increase, driven largely by small RNA deep sequencing experiments. In the face of these increases, and a range of microRNA annotation methods and criteria, maintaining the quality of the microRNA sequence data set is a significant challenge. Here, we describe recent developments of the miRBase database to address this issue. In particular, we describe the collation and use of deep sequencing data sets to assign levels of confidence to miRBase entries. We now provide a high confidence subset of miRBase entries, based on the pattern of mapped reads. The high confidence microRNA data set is available alongside the complete microRNA collection at We also describe embedding microRNA-specific Wikipedia pages on the miRBase website to encourage the microRNA community to contribute and share textual and functional information.

The carbohydrate-active enzymes database (CAZy) in 2013
Lombard, V; Ramulu, HG; Drula, E; Coutinho, PM; Henrissat, B
Nucleic Acids Res. 2014, 42, D490-D495
Free Full Text
The Carbohydrate-Active Enzymes database (CAZy; provides online and continuously updated access to a sequence-based family classification linking the sequence to the specificity and 3D structure of the enzymes that assemble, modify and breakdown oligo-and polysaccharides. Functional and 3D structural information is added and curated on a regular basis based on the available literature. In addition to the use of the database by enzymologists seeking curated information on CAZymes, the dissemination of a stable nomenclature for these enzymes is probably a major contribution of CAZy. The past few years have seen the expansion of the CAZy classification scheme to new families, the development of subfamilies in several families and the power of CAZy for the analysis of genomes and metagenomes. This article outlines the changes that have occurred in CAZy during the past 5 years and presents our novel effort to display the resolution and the carbohydrate ligands in crystallographic complexes of CAZymes.

NCBI GEO: archive for functional genomics data sets-update
Barrett, T; Wilhite, SE; Ledoux, P; Evangelista, C; Kim, IF; Tomashevsky, M; Marshall, KA; Phillippy, KH; Sherman, PM; Holko, M; Yefanov, A; Lee, H; Zhang, NG; Robertson, CL; Serova, N; Davis, S; Soboleva, A
Nucleic Acids Res. 2013, 41, D991-D995
Free Full Text
The Gene Expression Omnibus (GEO, is an international public repository for high-throughput microarray and next-generation sequence functional genomic data sets submitted by the research community. The resource supports archiving of raw data, processed data and metadata which are indexed, cross-linked and searchable. All data are freely available for download in a variety of formats. GEO also provides several web-based tools and strategies to assist users to query, analyse and visualize data. This article reports current status and recent database developments, including the release of GEO2R, an R-based web application that helps users analyse GEO data.

Ribosomal Database Project: data and tools for high throughput rRNA analysis
Cole, JR; Wang, Q; Fish, JA; Chai, BL; McGarrell, DM; Sun, YN; Brown, CT; Porras-Alfaro, A; Kuske, CR; Tiedje, JM
Nucleic Acids Res. 2014, 42, D633-D642
Free Full Text
Ribosomal Database Project (RDP; provides the research community with aligned and annotated rRNA gene sequence data, along with tools to allow researchers to analyze their own rRNA gene sequences in the RDP framework. RDP data and tools are utilized in fields as diverse as human health, microbial ecology, environmental microbiology, nucleic acid chemistry, taxonomy and phylogenetics. In addition to aligned and annotated collections of bacterial and archaeal small subunit rRNA genes, RDP now includes a collection of fungal large subunit rRNA genes. RDP tools, including Classifier and Aligner, have been updated to work with this new fungal collection. The use of high-throughput sequencing to characterize environmental microbial populations has exploded in the past several years, and as sequence technologies have improved, the sizes of environmental datasets have increased. With release 11, RDP is providing an expanded set of tools to facilitate analysis of high-throughput data, including both single-stranded and paired-end reads...

Back to the top