SFmap Manual

Input

Genome: Distinguishes between human, mouse and other genomes. For non-human/mouse genomes, the calculation is performed directly on the input sequence, without taking into account genomic information. Therefore, the following functions are restricted:

  • No database assembly selection.
  • The input sequences must be loaded in FASTA format (The genomic coordinates option is not availabe).
  • The COS(WR) scoring function is not available (Only the WR stage of the scoring function is applied).
  • The terminal positions of the input sequence (in length of [window size]/2 bp) cannot be calculated.
  • The results cannot be visualized in the UCSC genome browser.

Database assembly: Enables to choose the version of the human genome sequence. (This option is not available for other genomes). Current human genome versions:

The current mouse genome version is: Mouse July 2007 (NCBI37/mm9) assembly, produced by NCBI and the Mouse Genome Sequencing Consortium.

Input type: SFmap mandatory input is a sequence or a list of sequences. The minimal length for a sequence is 21 bp and the maximal is 5,000 bp. If you have longer sequences, please divide them into segments up to 5,000 bp long.
There are two available formats for the input sequences:

  • Sequences: In FASTA format (View example).
  • Genomic Coordinates: In the format of: chromosome:start-end:strand. Each line should have a title as in FASTA format (View example). (This option is not available for non-human/mouse genomes).

Note: The maximal number of entries per SFmap run is 5,000. If you have a larger amount of sequences, please divide them into several SFmap jobs.

Motifs selection

By default, SFmap searches for a list of binding sites of known splicing factors. (View list)

  • Custom motif selection: Enables to select specific motifs of interest from SFmap's list. It is possible to clear all pre-defined motifs and search for user-defined motifs only.
  • User-defined motifs: Enables to add user motifs to SFmap's search. The number of added motifs is not limited, you can add as many motifs as you wish by clicking the 'Add another motif' button. To delete the additional motifs lines from the browser, press F5 or use the browser's refresh button.
    The motifs should be 4-10 bp long and contain IUPAC symbols only.
    The default factor name is 'user<N>', you can replace it with your factor name.

Advanced options

Scoring Function:

  • COS(WR): This is the recommended option for human/mouse sequences. The calculation is done in two steps: First, the Weighted Rank (WR) function is applied to calculate the clustering propensity of the motif. Second, the evolutionary conservation is estimated by further weighting the WR scores with the Conservation Of Score (COS) function. The COS function is applied on the human-mouse alignment of the input sequence, therefore, only positions covered by the alignment can be calculated.
    Note: SFmap uses BLAT to determine the genomic coordinates. The coordinates information is used to retrieve the human-mouse alignment, as well as to expand the input sequence by [window size]/2 so that the terminal positions will be calculated. In case BLAT cannot find a match for the input sequence or no human-mouse alignment is found, only the WR stage is implemented and the terminal positions of the sequence are not calculated.
  • WR: In this option, only the WR step of the calculation is applied and no human-mouse alignment is required.
    This option is suitable when running non-human/mouse sequences or in case the human-mouse alignment is partial and does not cover most of the input sequence.

Window size: The size of the window used to calculated the multiplicity; Default is 50 bp.

Threshold: Two thresholds (significant and suboptimal) are used to calculate the WR function. This option enables to control the stringency level. Stringency options:

  • Medium stringency (default): Threshold[significant] at p-value <0.005; Threshold[suboptimal] at p-value <0.05.
  • Low stringency: Threshold[significant] at p-value <0.01; Threshold[suboptimal] at p-value <0.05.
  • High stringency: Threshold[significant] at p-value <0.005; Threshold[suboptimal] at p-value <0.01.
  • Exact match: Threshold[significant]: include all exact match hits instead of using p-value; Threshold[suboptimal] at p-value <0.05.

For more information about SFmap algorithm and the calculation parameters, see equation 1 and figure 1 in Akerman et al.

General options

E-mail address: The E-mail address is an optional field, required in order to get a link to the results page. If you don't get an E-mail from SFmap within a reasonable time, check your spam folder, it might accidentally get there. Note that large jobs may take more than 24 hours.

Job name: An optional parameter that enables you to give your job an informative name, otherwise, it will get a unique number identifier.

Results

SFmap results page consists of two links for each input sequence: (View results page example)

  1. The motifs prediction summary text file. (View summary example)
  2. A visualized display of the motifs using the UCSC Genome Browser. (View Genome Browser example)

In cases there is more than one input sequence, an additional file, which summarizes the motifs predictions for all the input sequences together, is given as well (View all sequences summary example)

Note: SFmap uses BLAT to determine the input sequence coordinates in the human and mouse genomes. In cases BLAT does not find at least 95% identity match for the input sequence, there is no possibility to display SFmap results in the Genome Browser.

1. Motifs prediction summary file

This is a summary table of the predicted binding sites and their location on the input sequence. Each motif is shown separately with its cutoff, which is a theoretical calculated threshold. Only hits with scores above the motif's specific cutoff are displayed. The hits are ordered by their sequence position and contain the following information:
Sequence Position: The starting position of the motif in the input sequence.
Genomic Coordinate: The genomic coordinate in which the motif starts.
K-mer: The predicted motif that is found in the input sequence.
Score: The motif's similarity score calculated by the COS(WR) algorithm.
Since the COS(WR) algorithm does not require 100% similarity to the known motif, two different motifs can be predicted in the same position.

It is recommended to save this file and open it with a spreadsheet program such as Microsoft Excel.

2. UCSC Genome Browser visualization

The Genome Browser visualizes the predicted motif binding sites on the input sequence. The hits are displayed at their starting position in the sequence. Each track (line) represents a motif, which is predicted to appear at least once in the sequence. Motifs that belong to the same splicing factor are grouped together and are displayed in the same color.
Note: The Genome Browser displays only the sense strand. If your sequence is on the antisense strand, read the Genome Browser results from right to left.

Default presentation: Displays three kinds of tracks on the complete sequence:

  1. The predicted motifs custom tracks in a dense mode, where the different scores are represented by different shades of gray (a higher score is darker).
  2. An alignment track that shows the input sequence coverage by the human-mouse alignment (in dark gray). This track is displayed only in cases the COS(WR) algorithm was applied. Positions that are not covered by the alignment were not calculated.
  3. The 'UCSC Genes' track in a full mode.

     View default presentation example

Other useful display options:

  • Display all the motifs tracks or part of them in full visibility mode: Click on the track name (left to the sequence). You can also go to the 'Custom Tracks' section, select 'full' in the desired tracks and click the 'refresh' button. The full visibility mode enables to view the scores as a bar diagram. The horizontal black line represents the specific cutoff of the motif. (View full visibility mode example)
  • Hide custom tracks: Go to the 'Custom Tracks' section, select 'hide' in the desired tracks and click the 'refresh' button. (View hide tracks example)
  • Sequence navigation: Zoom in or out the sequence, upstream or downstream, using the buttons at the top of the Genome Browser. For example: zoom in to 'base' resolution and right (downstream).
    (View sequence navigation example)
  • Add an annotated track. For example: go to 'Comparative Genomics' section, select 'full' in the 'Conservation' track and click the 'refresh' button. The 'Conservation' track will be added and displayed in a full visibility mode.
    (View add annotated tracks example)

For more information about the UCSC Genome Browser options, go to the Genome Browser User Guide.

This web site is supported by Eliyahu Pen Research Fund