• Home   /  
  • Archive by category "1"

Dreme Motif Analysis Essay

Abstract

ChIP-seq is increasingly used to characterize transcription factor binding and chromatin marks at a genomic scale. Various tools are now available to extract binding motifs from peak data sets. However, most approaches are only available as command-line programs, or via a website but with size restrictions. We present peak-motifs , a computational pipeline that discovers motifs in peak sequences, compares them with databases, exports putative binding sites for visualization in the UCSC genome browser and generates an extensive report suited for both naive and expert users. It relies on time- and memory-efficient algorithms enabling the treatment of several thousand peaks within minutes. Regarding time efficiency, peak-motifs outperforms all comparable tools by several orders of magnitude. We demonstrate its accuracy by analyzing data sets ranging from 4000 to 1 28 000 peaks for 12 embryonic stem cell-specific transcription factors. In all cases, the program finds the expected motifs and returns additional motifs potentially bound by cofactors. We further apply peak-motifs to discover tissue-specific motifs in peak collections for the p300 transcriptional co-activator. To our knowledge, peak-motifs is the only tool that performs a complete motif analysis and offers a user-friendly web interface without any restriction on sequence size or number of peaks.

INTRODUCTION

ChIP-seq ( 1 , 2 ) has recently become a method of choice to study the binding preferences of transcription factors, as well as the localization of epigenetic regulatory marks at a genomic scale. The first steps of the computational analysis (read mapping and peak calling) typically result in several thousands of peak regions ranging between 200 and 10 000 bp. Motif analysis is required to extract the relevant information from these regions: discover binding motifs that capture the binding specificity of the pulled-down factor and their possible co-regulators; compare discovered motifs to databases to predict associated transcription factors; predict the exact positions of the binding sites (usually much shorter than the peak regions); study the binding specificity of transcription factors in various contexts (cell types, mutant strains and transcription factor isoforms).

Specialized software tools have recently been developed for the analysis of ChIP-seq peaks, supporting different combinations of motif-related tasks ( Table 1 ). An important bottleneck for most existing tools is that the underlying algorithms were originally developed to discover binding motifs from a small set of co-regulated promoters, and can hardly treat the thousands of peaks produced by ChIP-seq experiments. This limitation is typically circumvented by restricting motif discovery to a few hundreds peak regions and by truncating the peaks to a maximal width (e.g. 100 bp) to further reduce the total size of the sequence set ( 3–5 ). However, given the power of the genome-wide experimental approach, one would like to be able to analyze the full data set. Some alternative algorithms support the analysis of large-scale data sets but are only available via a Unix shell interface ( 6–8 ), or as MATLAB functions ( 9 ), and are thus of poor usability for life-science researchers.

Table 1.

Features of software tools used for analyzing motifs in ChlP-seq peak seqm

Program Peak-motifs ChipMunk CompleteMotifs MEME-ChIP MICSA GimmeMotifs 
Web interface Yes Yes Yes Yes No No 
Size limitation Unrestricted (website tested with 22 Mb) 100 kb (website) 500 kb (web site) Unrestricted, but analysis limited to 600 peaks clipped to 100 bp Motif discovery restricted to a few hundred base pairs – 
Stand-alone version Yes Yes No Yes Yes Yes 
Tasks 
    Peak finding No No No No Yes No 
    Annotation of peak-flanking genes No No Yes No No 
    Sequence composition (mono- and di-nucleotides) Yes No No No No 
    Motif discovery Yes Yes Yes Yes Yes Yes 
    Enrichment in motifs from databases No No Yes Yes No 
    Enrichment in discovered motifs Yes No No No No 
    Peak scoring No No No Yes Yes No 
    Motif clustering No No No No Yes 
    Comparison discovered motifs/motif DB Yes No No Yes Yes 
    Sequence scanning for site prediction Yes No No Yes No 
    Positional distribution of sites inside peaks Yes No Yes No Yes 
    Visualization in genome browsers Yes No Yes No No 
Motif discovery algorithms RSAT oligo-analysis RSAT dyad-analysis RSAT local-word-analysis MEME ChlPMunk ChipMunk ChipMunk MEME Weeder MEME DREME MEME MEME Weeder MotifSampler BioProspector Gadem Improbizer MDmodule Trawler MoAn 
Pattern matching algorithms RSAT matrix-scan-quick No patser MAST + AME (enrichment) No 
Motif comparison algorithm RSAT compare-motifs No STAMP TOMTOM STAMP 
Motif clustering algorithm STAMP 
Comparison between discovered motifs Yes No Yes No Yes 
Motif database comparisons JASPAR UNIPROBE DMMPMM RegulonDB upload your own database No JASPAR TRANSFAC JASPAR TRANSFAC UNIPROBE FLYREG DPINTERACT SCPD DMMPMM and many others No 
Motif sizes Variable (multiple word assembly) User-specified ≤25 for MEME ≤12 for Weeder ≤ 13 for ChipMunk Predefined ranges (small, medium, large, extra-large) 
Multiple motifs Yes Yes Yes Yes 
Ref (PMID) This article 20736340 21183585 21486936 20375099 21081511 
Program Peak-motifs ChipMunk CompleteMotifs MEME-ChIP MICSA GimmeMotifs 
Web interface Yes Yes Yes Yes No No 
Size limitation Unrestricted (website tested with 22 Mb) 100 kb (website) 500 kb (web site) Unrestricted, but analysis limited to 600 peaks clipped to 100 bp Motif discovery restricted to a few hundred base pairs – 
Stand-alone version Yes Yes No Yes Yes Yes 
Tasks 
    Peak finding No No No No Yes No 
    Annotation of peak-flanking genes No No Yes No No 
    Sequence composition (mono- and di-nucleotides) Yes No No No No 
    Motif discovery Yes Yes Yes Yes Yes Yes 
    Enrichment in motifs from databases No No Yes Yes No 
    Enrichment in discovered motifs Yes No No No No 
    Peak scoring No No No Yes Yes No 
    Motif clustering No No No No Yes 
    Comparison discovered motifs/motif DB Yes No No Yes Yes 
    Sequence scanning for site prediction Yes No No Yes No 
    Positional distribution of sites inside peaks Yes No Yes No Yes 
    Visualization in genome browsers Yes No Yes No No 
Motif discovery algorithms RSAT oligo-analysis RSAT dyad-analysis RSAT local-word-analysis MEME ChlPMunk ChipMunk ChipMunk MEME Weeder MEME DREME MEME MEME Weeder MotifSampler BioProspector Gadem Improbizer MDmodule Trawler MoAn 
Pattern matching algorithms RSAT matrix-scan-quick No patser MAST + AME (enrichment) No 
Motif comparison algorithm RSAT compare-motifs No STAMP TOMTOM STAMP 
Motif clustering algorithm STAMP 
Comparison between discovered motifs Yes No Yes No Yes 
Motif database comparisons JASPAR UNIPROBE DMMPMM RegulonDB upload your own database No JASPAR TRANSFAC JASPAR TRANSFAC UNIPROBE FLYREG DPINTERACT SCPD DMMPMM and many others No 
Motif sizes Variable (multiple word assembly) User-specified ≤25 for MEME ≤12 for Weeder ≤ 13 for ChipMunk Predefined ranges (small, medium, large, extra-large) 
Multiple motifs Yes Yes Yes Yes 
Ref (PMID) This article 20736340 21183585 21486936 20375099 21081511 

View Large

We have developed a computational pipeline called ‘peak-motifs’, motivated by the pressing need for a statistically reliable, time-efficient and user-friendly framework to analyze full data sets of ChIP-seq peaks or similar data (ChIP-PET, ChIP-on-chip, CLIP-seq). This comprehensive pipeline takes as input a set of peak sequences, discovers exceptional motifs, compares them with motif databases, predicts binding site positions and returns a structured HTML report with direct links to visualization in the UCSC genome browser ( Figure 1 ). This tool can also be used for differential analyses, where two datasets are given as input (e.g. test versus control, or peaks from two experimental conditions), to discover motifs specific to one of the datasets.

Figure 1.

Schematic flow chart of the peak-motifs pipeline . For sake of clarity, only the main analysis steps are depicted. The pipeline takes as input a set of peak sequences, and runs several de novo motif discovery algorithms based on different detection criteria: over-representation, differential representation (test versus control), global position bias or local over-representation along the centered peaks. Transcription factors are predicted by matching discovered motifs against several public motif databases and/or against user-uploaded motif collections. Peak sequences are scanned with the discovered motifs to predict precise binding positions. These positions are then automatically exported as an annotation track for UCSC genome browser, thus enabling a flexible visualization in their genomic context.

Figure 1.

Schematic flow chart of the peak-motifs pipeline . For sake of clarity, only the main analysis steps are depicted. The pipeline takes as input a set of peak sequences, and runs several de novo motif discovery algorithms based on different detection criteria: over-representation, differential representation (test versus control), global position bias or local over-representation along the centered peaks. Transcription factors are predicted by matching discovered motifs against several public motif databases and/or against user-uploaded motif collections. Peak sequences are scanned with the discovered motifs to predict precise binding positions. These positions are then automatically exported as an annotation track for UCSC genome browser, thus enabling a flexible visualization in their genomic context.

We first show that this motif discovery approach is significantly faster than other available alternatives, thereby allowing processing of comprehensive ChIP-seq data sets, even from the web server. We then demonstrate the biological relevance of the motifs discovered by our pipeline with two study cases, highlighting the benefit of analyzing complete datasets and using complementary approaches for motif discovery.

MATERIALS AND METHODS

The motif discovery step relies on a combination of tried-and-tested algorithms integrated in the software suite regulatory sequence analysis tools (RSAT, http://rsat.ulb.ac.be/rsat/ ) ( 10–12 ), which use complementary criteria to detect exceptional words (oligonucleotides and spaced motifs): global over-representation of oligonucleotides ( oligo-analysis ) or spaced pairs ( dyad-analysis ), heterogeneous positional distribution ( position-analysis ) and local over-representation ( local-word-analysis ) ( 12–15 ).

The motif comparison step is performed by compare-matrices ( 12 ), which supports a wide range of scoring metrics and displays the results as multiple alignments of logos, enabling to grasp the similarities between a discovered motif and several known motifs. This feature is particularly valuable to reveal adjacent fragments of the discovered motif showing similarities with two distinct known motifs, suggesting a bipartite motif for two factors (see the SOCT motif in Figure 4 and below).

As the individual components of the workflow have been described previously ( 12 ), we briefly explain here the choice of parameters for the different steps of peak-motifs analyses. The full list of commands and parameters are automatically reported at the end of each peak-motifs report. The parameters used for the case studies are available in the peak-motifs reports on the supporting website ( http://rsat.bigre.ulb.ac.be/~rsat/supp_material_peak-motifs/ ).

Motif discovery

Word-based analysis is performed with hexanucleotides ( k = 6 ) and heptanucleotides ( k = 7

Abstract

The MEME Suite is a powerful, integrated set of web-based tools for studying sequence motifs in proteins, DNA and RNA. Such motifs encode many biological functions, and their detection and characterization is important in the study of molecular interactions in the cell, including the regulation of gene expression. Since the previous description of the MEME Suite in the 2009 Nucleic Acids Research Web Server Issue, we have added six new tools. Here we describe the capabilities of all the tools within the suite, give advice on their best use and provide several case studies to illustrate how to combine the results of various MEME Suite tools for successful motif-based analyses. The MEME Suite is freely available for academic use at http://meme-suite.org, and source code is also available for download and local installation.

INTRODUCTION

A DNA, RNA or protein sequence motif is a short pattern that is conserved by evolution. In DNA, a motif may correspond to a protein-binding site; in proteins, a motif may correspond to the active site of an enzyme or a structural unit necessary for proper folding of the protein. Thus, sequence motifs are one of the basic functional units of molecular evolution. Consequently, identifying and understanding these motifs is fundamental to building models of cellular processes at the molecular scale and to understanding the mechanisms of human disease.

The MEME Suite is a software toolkit for performing motif-based sequence analysis, which is valuable in a wide variety of scientific contexts. The MEME Suite software has played an important role in the study of biological processes involving DNA, RNA and proteins in over 9800 published studies. With the advent of high-throughput genomics and proteomics, the importance of motif analysis continues to increase. The MEME Suite has been used to make a wide variety of biological discoveries, examples of which are listed in Supplementary Table S1.

The web-based version of the MEME Suite comprises an integrated set of tools and databases for performing motif-based sequence analyses (Figure 1). The core of the suite is the meme motif discovery algorithm, which finds motifs in unaligned collections of DNA, RNA and protein sequences (1). Initially described in 1994, meme has been continually maintained and improved in the ensuing 20 years. The meme web server came online in 1996 and is now widely used, with almost 20 000 unique users in 2014 alone.

Figure 1.

Overview of the integrated tools in the MEME Suite. Tools added since the MEME Suite web server was last described (15) are underlined.

Figure 1.

Overview of the integrated tools in the MEME Suite. Tools added since the MEME Suite web server was last described (15) are underlined.

Using the MEME Suite

The web-based version of the MEME Suite includes 13 tools (1,2,3,4,5,6,7,8,9,10,11,12,13) for performing motif discovery, motif enrichment analysis, motif scanning and motif–motif comparisons (Figure 1). Six of these tools—DREME (3), MEME-ChIP (4), CentriMo (6), AME (7), SpaMo (8) and MCAST (12)—were developed or given web interfaces since the last publication describing the MEME Suite (15). For motif discovery and motif enrichment analyses, the user provides a set of unaligned DNA, RNA or protein sequences (Figure 1, upper left). Typically, these sequences might be ChIP-seq peak regions, cross-linking sites from a CLIP-seq experiment, promoters of co-expressed genes or proteins sharing a common function such as being modified by the same kinase.

Motif discovery finds de novo motifs in the user-provided sequences (Figure 1, upper middle). These motifs can then be input directly to the motif scanning and motif comparison tools of the MEME Suite (Figure 1, right) to identify other proteins or genomic sequences that may contain the discovered motifs, or to determine if the motifs are similar to previously studied motifs. The MEME Suite provides a large number of proteomic and genomic sequence databases (Figure 1, top right) for motif scanning and many motif databases for motif comparison (Figure 1, bottom right).

The four different motif discovery algorithms suit different purposes. meme is a general purpose motif discovery algorithm for both nucleotide and peptide motifs, but is less sensitive than DREME for finding short nucleotide motifs. Neither meme nor DREME allows insertions or deletions in the motifs they find, but glam2 does. Finally, meme-chip is adapted to very large datasets that cannot be handled by meme, and it actually performs motif discovery, motif enrichment and motif comparison on its input sequences, producing a fully integrated report. A comprehensive protocol for using meme-chip has recently been published (16).

Motif enrichment analysis tests known motifs for enrichment in a set of user-provided sequences. This approach is more sensitive than motif discovery, but motif enrichment analysis is limited to detecting enrichment of motifs contained in the database of motifs selected as input (Figure 1, middle left). Sensitivity is highest with CentriMo, which leverages the extra information sometimes contained in the position of the motif within each of the input sequences. The sequences input to CentriMo must all have the same length, which is not the case with the less sensitive motif enrichment algorithm AME. The SpaMo algorithm looks for preferred spacings in the input sequences between two motifs, rather than enrichment of a single motif. Finally, the gomo algorithm performs motif scanning of promoter sequences followed by a Gene Ontology enrichment analysis, so it is often applied to de novo discovered motifs to identify their possible biological functions.

Motif scanning involves identifying locations of occurrences of a given set of motifs within a given set of sequences. As with motif discovery, the four motif scanning tools suit distinct purposes (Figure 1, middle right). The fimo algorithm identifies all individual motif occurrences and is the method of choice for scanning genomes. Its output can be uploaded to the UCSC genome browser for viewing. In contrast, the mast algorithm is sequence oriented and assigns each sequence in the selected database a score based on how well it matches all of the motifs input by the user. Thus, mast is most suited to scanning short sequences such as proteins or promoters. The mcast algorithm scans genomes for clusters containing multiple matches to any or all of the motifs in its input. It was designed for detecting cis-regulatory modules (CRMs) bound by a known set of transcription factors. Finally, the glam2scan algorithm is similar to fimo but is designed to accept glam2 motifs; hence, the resulting motif matches may contain insertions and deletions.

The MEME Suite's motif comparison tool, Tomtom, allows the user to compare motifs discovered by the suite, by other tools, or taken from the literature to all of the motifs in a selected database of motifs (Figure 1, bottom right). For example, Tomtom can be used to determine if a reported consensus sequence for a transcription factor motif matches any known motifs in databases of motifs produced using SELEX or protein-binding microarrays. Tomtom aligns each input motif with each motif in the selected database and reports the most similar pairs, along with estimates of the statistical significance of each match.

Users can also use the MEME Suite with motifs generated by other motif analysis tools or taken directly from the literature (Figure 1, bottom left). As described in the next section, the web server allows user-specified motifs to be input in many convenient formats. Although omitted from Figure 1 for clarity, the motif scanning tools also allow for user-provided sequences, and the motif comparison tool allows for uploading user-provided motif databases.

The MEME Suite web interface

The MEME Suite provides a set of consistent input forms for its 13 web-based tools (e.g. Figure 2). For each input field, ‘help bubbles’ provide an explanation of what information is required, how you can provide it and, in many cases, examples of valid input. You can view a help bubble by clicking on the question mark ‘?’ situated to the right of the input field. Within each of the groupings of MEME Suite motif analysis tools (discovery, enrichment, scanning, comparison), the user interfaces are consistent and flexible. For example, you can input sequences required by the first three groupings either by selecting a file for upload or by typing (or cut-and-paste). As a second example of consistency and versatility, all the tools that accept motifs as input (for enrichment, scanning or comparison) allow you to upload them by selecting a file name or by typing or cutting-and-pasting one or more motifs in any of a number of different formats.

When you enter motifs by typing, the web interface automatically detects whether you are specifying a motif as one or more sequence sites (e.g. a consensus sequence or a multiple alignment) or as a count or probability matrix and interactively displays a logo for the motif (Figure 2). Typed sequence sites allow the entire IUPAC alphabet (including ambiguous characters) for DNA and proteins. If you enter numbers instead of letters, then the web interface assumes you are entering either a count matrix or a probability matrix, and automatically determines whether rows correspond to positions in the motif or to letters in the alphabet. As you enter motifs by typing, the web interface reports errors such as unsupported characters or if the sequence sites are of inconsistent lengths. Multiple motifs may be specified simply by separating them by a blank line. All typed motifs are automatically converted to motifs in the meme motif format. Note, however, that glam2scan does not support typed motifs because it uses a different motif format.

The MEME Suite website also provides access to a large number of motif and sequence databases for use in your analyses. For example, you can select from among 38 different motif databases for use with the motif enrichment and comparison tools. These databases include in vitro compendia based on SELEX or protein-binding microarrays, as well as human-curated compendia of in vivo or in vitro motifs such as JASPAR (14). All of these databases are also available for you to download and use on your own computer via the ‘Download & Install’ menu on the MEME Suite website.

Similarly, you can select from a large menu of DNA and protein databases for use with the motif scanning tools. These include protein and genomic databases from Ensembl and GenBank, genomes from UCSC, as well as sets of promoters (upstream regions) for many organisms. To specify which sequence database you wish to search (Figure 2), you first select the database category (e.g. ‘Ensembl Ab Initio Predicted proteins’), then the organism, e.g. ‘Human’, followed by the version of the database (e.g. ‘75’).

After you submit a job to a MEME Suite tool, you will be taken to a status page showing the job's progress. The ‘Recent Jobs’ menu item on the left of most MEME Suite web-server pages will allow you to access this status page as long as your current browser session is active. If you plan to exit the browser or log off before your job completes, you should either bookmark the status page or you can choose to provide an (optional) email address when you submit jobs. Once your job completes, the status page will contain links to your results in HTML and other formats. Your results will be kept on our server for only a few days, so you should download them (using your browser's ‘File/Save’ feature) if you wish to save them indefinitely.

User support for the MEME Suite

User support for the MEME Suite includes extensive online documentation, an online user forum and email support. All user support is accessible via the menu on the left side of all MEME Suite web pages (e.g. Figure 2). Clicking the ‘Manual’ tab of the menu reveals links to detailed information on each of the web-based tools. Clicking the ‘OVERVIEW’ link (located at the top of the list of tools in the ‘Manual’ sub-menu) takes you to a page describing the entire suite, including the many supplementary tools for manipulating sequences and motifs available when you download and install the meme suite on your own computer.

Additional information is provided for some of the tools under the ‘Guides & Tutorial’ tab (e.g. Figure 2, left). The ‘Sample Outputs’ tab reveals links to example outputs from each of the web-based tools. Viewing these outputs is extremely useful for gaining an understanding of the capabilities of the individual tools and of their suitability to any particular task.

Advanced use of the MEME Suite sometimes involves creating custom motif and sequence files. Details on the relevant file formats is provided under the ‘File Format Reference’ tab of the main menu. Finally, links to the user ‘Q&A’ forum and for emailing the webmaster or developers are shown by clicking the ‘Help’ tab. The ‘Q&A’ forum is a very useful source of answers to frequently asked questions and is constantly updated.

Comparison with other motif analysis tools

Although many motif discovery and search tools have been described in the scientific literature, the MEME Suite is unique in terms of its broad functionality and consistent reporting of statistical significance for all of its outputs (Table 1). The MEME Suite provides motif discovery algorithms using both probabilistic (meme) and discrete models (DREME), which have complementary strengths. It also allows discovery of motifs with arbitrary insertions and deletions (glam2), which no other web-based tools do. Many other tool suites are DNA-only, but the MEME Suite supports motif discovery in and motif scanning of DNA, RNA and protein sequences. The meme, fimo and mcast algorithms also allow the use of sophisticated probabilistic priors for improving motif discovery and search with additional context-specific information (26–32). In addition to motif discovery, the MEME Suite provides tools for scanning sequences for matches to motifs (mast, fimo and glam2scan), scanning for clusters of motifs (mcast), comparing motifs to known motifs (Tomtom), finding preferred spacings between DNA motifs (SpaMo), predicting the biological roles of DNA motifs (gomo), measuring the positional enrichment of sequences for known DNA motifs (CentriMo), and analyzing ChIP-seq and other large DNA datasets (meme-chip). We are aware of no existing server that provides anything close to the MEME Suite's breadth of functionality.

MEME Suite capabilities

Table 1.

MEME Suite capabilities

Capability meme Suite (17) Consensus (18) Gibbs Sampler (19) RSAT (20) Trawler (21) W-ChIPMotifs (22) MoDTools (23) YMF 3.0 (24) XXmotif (25) 
DNA motif discovery meme✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ 
Protein motif discovery meme✓ ✓ 
Probabilistic motif discovery meme✓ ✓ ✓ ✓ ✓ 
Discrete motif discovery DREME ✓ ✓ ✓ 
Arbitrary gaps in motifs glam2, glam2scan ✓ 
Positionally constrained motifs meme✓ 
Discriminative PWM motif discovery meme✓ ✓ 
Motif scanning fimo, mcast, glam2scan ✓ ✓ ✓ ✓ 
Motif enrichment analysis CentriMo, AME, SpaMo, gomo ✓ ✓ 
Motif comparison Tomtom ✓ 
Motif cluster & spacing analysis SpaMo 
Motif functional role analysis gomo 
ChIP-seq analysis meme-c

One thought on “Dreme Motif Analysis Essay

Leave a comment

L'indirizzo email non verrà pubblicato. I campi obbligatori sono contrassegnati *